This study was nested within an assessor- and participant-blinded randomized controlled trial comparing corticosteroid injection with placebo given 2 weeks prior to 12 weeks of supervised exercise in people with knee OA . Inclusion criteria for the trial were: age 40 and over, symptomatic and radiologically verified diagnosed knee OA, ‘pain while walking on a flat surface’ of at least 4 on a 0–10 NRS, and a body mass index of 20 or more, but less than 35 kg/m2. Exclusion criteria included use of intra-articular corticosteroids in the knee or participation in physiotherapeutic exercise for knee OA within the last 3 months, or severe concomitant diseases. All participants in this study gave informed consent before enrolling in the hosting trial. Each participant received a copy of the consent.
Dynamic weight-bearing Assessment of Pain (DAP)
The DAP  is a simple performance test with an integrated pain score, designed to provide useful information for monitoring treatment progress and evaluating treatment effects in knee OA. The patient is asked to perform as many standing knee-bends as possible within 30 s. For each knee bend, the knees should reach close to 90° of flexion and full extension. This movement is supervised by the rater. The test score is self-reported pain intensity during knee bends on a 0–10 NRS reported immediately after the 30 s of knee-bends as the worst pain during the test. Thus, the pain intensity score is an assessment of pain during performance of a specific weight-bearing activity. The DAP takes 1–2 min to perform, including verbal instructions. Administering the test does not require any equipment beside a stopwatch/watch. The DAP was applied at baseline and at the end-of-treatment visit.
Knee injury and Osteoarthritis Outcome Score (KOOS)
The KOOS  is a questionnaire for assessing patient-reported symptoms. KOOS consists of 5 subscales assessing different constructs: Symptoms (7 items), Pain (9 items), Function in daily living (16 items), Function in sports and recreation (5 items), and knee-related Quality of Life (QoL) (4 items). Responses are given using Likert boxes, and each question is assigned a score from 0–4. A normalized score is calculated for each subscale ranging from 0 (extreme symptoms) to 100 (no symptoms). Reliability of KOOS in an OA population has been reported as acceptable . The KOOS questionnaire was applied at baseline and at the end-of-treatment visit.
The 6MWT  is a walking test measuring the total distance walked in 6 min. The distance is a surrogate measure of functional capacity and cardiovascular function, originally used for patients with heart and lung diseases. The 6MWT has shown acceptable test-retest reliability and responsiveness in knee OA populations [11, 12]; it was applied at baseline and at the end-of-treatment visit.
Pain after 6MWT (6MWTpain)
Pain rating on a 0–10 NRS immediately after the 6MWT (6MWTpain) is not a standardized test. However, similar ratings of pain subsequent to a performance test have been applied in other studies [5, 6]. The 6MWTpain was included to compare the DAP with this other assessment of pain during a performance test. The 6MWTpain was applied at baseline and at the end-of-treatment visit.
Transition questionnaire (TRANS-Q)
Transition ratings, or Global Perceived Effect scales, are recommended as a core outcome measure in chronic pain trials  and have been used as an external criterion to determine responsiveness  or Minimal Important Change (MIC)  of other measurement instruments. A transition questionnaire (TRANS-Q), modified from Jaschke et al. , was used for asking the participants about their experienced change in pain after the intervention with the question: “Did your knee pain change since you entered this project?” Response options were: “It is unchanged,” “It is better,” and “It is worse.” The ‘unchanged’ response is given a score of 0, and no further questions are asked. The responses “It is better” and “It is worse” bring up a seven-point scale, with scores spanning from −7 (worst) to +7 (best), respectively. For the purpose of this study, a clinically important change in pain was defined as a TRANS-Q score of at least 2 (+2: a little better; −2: a little worse). No change was defined as a TRANS-Q score of 0 (no change) or +1/-1 (Almost the same, hardly any better/worse at all). The transition questionnaire was implemented in the hosting trial after the trial commenced and therefore applicable to only a subset of the trial participants. The transition questionnaire was administered only at the end-of-treatment visit.
Kellgren & Lawrence
The Kellgren & Lawrence grading scale is used to assess radiographic severity of knee OA based on radiographic features; osteophytes, periarticular ossicles, narrowing of joint cartilage, sclerotic tissue, altered shape of bone ends. The scores are: 0 (no x-ray changes of OA), 1 (doubtful presence of OA), 2 (minimal presence of OA), 3 (moderate presence of OA), and 4 (severe presence of OA) . X-rays were taken at baseline and at the end-of-treatment visit.
The Ahlbäck classification of radiographic knee OA of the tibiofemoral joint also assesses radiographic severity of OA and has five grades: 1 (joint space narrowing, <3 mm), 2 (joint space obliteration), 3 (minor bone attrition, 0-5 mm), 4 (moderate bone attrition, 5-10 mm), 5 (severe bone attrition, >10 mm) . X-rays were taken at baseline and at the end-of-treatment visit.
Analyses involving hypothesis testing for validity and responsiveness (i.e., the validity of a change score), and determination of the Minimal Important Change (MIC) to interpret change scores of the DAP were conducted adhering to the COnsensus-based Standards for the selection of healthMeasurement INstruments (COSMIN) methodology [3, 18].
For validation studies, a minimum sample size of 50 is recommended, but larger samples are preferred . There is currently no consensus on standards for determining sample size in MIC studies. The statistical analyses were performed using SAS statistical software (version 9.3; SAS Institute Inc., Cary, NC, USA) and follow the COSMIN standards . As no gold standard exists for the construct ‘pain during activity,’ validity and responsiveness were evaluated through hypothesis testing. The construct validity of the DAP was evaluated by Spearman Correlation Coefficients with the other outcome instruments using baseline scores. Likewise, the responsiveness of the DAP was estimated by Spearman Correlation Coefficients with the other outcome instruments using change scores (baseline to end-of-treatment). There is no consensus about the magnitude of correlations required for acceptable convergent or divergent validity [3, 19], indicating that similar or different constructs, respectively, are assessed by the two instruments being compared. As the DAP was expected to assess a composite construct containing aspects of the constructs assessed by the other instruments, some correlation was expected. Thus, relatively high correlation criteria were applied, for both validity and responsiveness in this study; r >0.7 for convergence and r <0.7 for divergence, based on the common application of 0.7 as cutoff . Correlations below 0.2 were disregarded, as this is the critical point for a two-tailed 0.05 level of significance in an n = 100 sample (the sample of the hosting trial) .
Both the DAP and the 6MWTpain were expected to assess a construct of pain during activity (albeit two different activities). Thus, for both baseline and change scores, the DAP score correlations with the 6MWTpain score were hypothesized to be convergent. The 6MWT reflects a construct of physical capacity/cardiovascular function, whereas the subscales of the KOOS assess symptoms, pain, function in daily living, function in sports and recreation, and knee-related quality of life. Thus, for both baseline and change scores, the DAP score correlations with the 6MWT and the KOOS subscales scores were hypothesized to show divergence.
Responsiveness was further evaluated by patient-reported change (TRANS-Q), hypothesizing that the group that had experienced a change in pain would have a greater mean change in DAP scores than the group reporting no change in pain. Patient-reported change of pain (TRANS-Q) were also used as an external criterion to interpret the DAP change scores in terms of the MIC, which we defined as the optimal cutoff point on a Receiver Operating Characteristic (ROC) curve (i.e., the value for which the sum of misclassifications ([1 – sensitivity] + [1 – specificity]) is smallest.  The 95 % limit cutoff point is calculated as mean change + 1.645 * SD change of the group of participants who reported no change .