Development and validation of a new patient-reported outcome measure for patients with pressure ulcers: the PU-QOL instrument

Background Patient-reported outcome (PRO) data are integral to patient care, policy decision making and healthcare delivery. PRO assessment in pressure ulcers is in its infancy, with few studies including PROs as study outcomes. Further, there are no pressure ulcer PRO instruments available. Methods We used gold-standard methods to develop and evaluate a new PRO instrument for people with pressure ulcers (the PU-QOL instrument). Firstly a conceptual framework was developed forming the basis of PU-QOL scales. Next an exhaustive item pool was used to produce a draft instrument that was pretested using mixed methods (cognitive interviews and Rasch Measurement Theory). Finally, we undertook psychometric evaluation in two parts. This first part was item reduction, using PU-QOL data from 227 patients. The second part was reliability and validity evaluation of the item-reduced version using both Traditional and Rasch methods, on PU-QOL data from 229 patients. Results The final PU-QOL contains 10 scales for measuring symptoms, physical functioning, psychological well-being and social participation specific to pressure ulcers. It is intended for administration and patients rate the amount of “bother” attributed during the past week on a 3-point response scale. Scale scores are generated by summing items, with lower scores indicating better outcome. The PU-QOL instrument was found to be acceptable, reliable (Cronbach’s alpha values ranging 0.89 - 0.97) and valid (hypothesised correlations between PU-QOL and SF-12 scores (r >0.30) and PU-QOL scales and sociodemographic variables (r <0.30) were consistent with predictions). Conclusions The PU-QOL instrument provides a standardised method for assessing PROs, reflecting the domains in a pressure ulcer-specific conceptual framework. It is intended for evaluating patient orientated differences between interventions and in particular the impact from the perspective of patients.


Background
Chronic wounds are a major health problem and challenge to patients, healthcare professionals and healthcare systems. Pressure ulcers (PUs) are chronic wounds that occur as localised injury to the skin and/or underlying tissue usually over a bony prominence, as a result of pressure, or pressure in combination with shear [1]. They range in size and severity of tissue layer affected, with particularly vulnerable areas being the sacrum, buttocks and heels [2]. With widespread prevalence and incidence in all health settings [3], PUs, often a complication of serious acute or chronic illness, are a health problem associated with increased morbidity [4], mortality [5], healthcare costs and hospitalisation, and identified as a UK National Health Service (NHS) quality indicator [6].
Both PUs themselves and interventions for preventing and treating PUs impact health-related quality of life (HRQL) and can severely compromise all areas of patient functioning [7,8]. Clinical outcomes associated with PU prevention or healing, such as incidence or rate of healing, have been the focus of clinical inquiry; however, due to advances in health outcome measurement, such information alone is no longer sufficient to support progress being made in the PU field [9]. Cochrane reviews highlight the lack of robust evidence for the clinical effectiveness of a majority of PU treatments [10]; resource availability is not based upon health economic evaluation and there is no systematic way of considering patients' priorities for interventions. Therefore, clinical decision making continues despite being uninformed by high quality studies based on cost-effectiveness and patients' perspectives.
The field of health is reliant on health outcome measurement to provide a strong evidence-base, incorporating both patient perspectives and cost analyses. In health outcomes research, evaluation of intervention-related outcomes are often undertaken with the help of rating scales, or more recently called patient-reported outcome (PRO) instruments. PRO instruments are increasingly used in clinical studies for measuring outcome variables. In this role, instruments are the central dependent variables which treatment decisions are made. They can be useful tools for evaluating health changes following interventions if they are fit for purpose and accord with international standards for rigorous measurement [9,11].
Patient-based outcome measurement in PUs is in its infancy; few studies have measured PROs and those that have done, have used generic instruments [12]. A PRO instrument specific to PUs could help improve the evidence-base through research assessing effectiveness of PU therapies; facilitate clinician-patient communication and shared decision making; prioritise patient problems and preferences; monitor changes or outcomes of treatment; measure the performance of healthcare providers and services; and be used in clinical audit [13][14][15].
Our previous work has identified PROs important to people with PUs [7,16,17], established the need for a patient-reported measure of outcomes specific to PUs [12], and developed a provisional version of such a measure (the PU-QOL instrument). The PU-QOL instrument was developed on the basis of a PU-specific HRQL conceptual framework [16] and existing PU and HRQL literature [7]. These sources provided insight into variables important for measurement from the perspective of patients with PUs and were used to generate an exhaustive list of items. The item list (n=122) was transformed into scales intended to define coherent clinically meaningful constructs (scales) consisting of items representing aspects of the continuum of each construct, reflecting the domains within our conceptual framework. This produced a preliminary PU-QOL version which was pre-tested through cognitive interviews with 35 patients with PUs [18], producing a provisional PU-QOL version. Pre-testing identified potential strengths and weaknesses of PU-QOL items, guided decision-making about modifications to items (content and response options) and questionnaire design, and provided early evidence for validity and clinical utility of each PU-QOL scale as reflected by clinically meaningful hierarchical scales, prior to formal psychometric evaluation.
The aim of this study was to provide researchers and clinicians with a comprehensive evaluation of some of the fundamental psychometric measurement properties of the provisional PU-QOL instrument.

Methods
We followed international PRO guidelines [9,[19][20][21] for the development and validation of the PU-QOL instrument ( Figure 1). Collaboration was sought from members of the European Pressure Ulcer Advisory Panel (EPUAP) and from 29 acute and primary care NHS organisations around the UK. A UK NHS Research Ethics Committee provided ethical approval and all participants gave written informed consent to participation.

Field test one design Sample
The first field test was undertaken to construct PU-QOL scales and perform a preliminary psychometric evaluation in a large sample of patients with PUs. Patients from acute and community NHS Trusts around England and Scotland were included if they were aged ≥18 years, with an existing PU of any category, location or duration, and able to provide informed consent to participate. Patients were excluded if they had only moisture lesions, were unconscious, confused, cognitively impaired, deemed ethically inappropriate to approach, did not speak or understand English or unable to provide informed consent.
Eligible patients were purposively sampled, ensuring balanced representation across PU categories (superficial, severe) and skin sites (torso, limb), setting (acute, community), age (<70 years, ≥70) and gender. The 'rule of thumb' sample size recommendation for psychometric analyses of new summated scales is five to 10 subjects per item, to reduce the effect of chance [19,22]. Following this recommendation, if the longest potential summated scale was taken (pain containing 11 items), then a 110 patient sample would be required. For the Rasch analysis, a sample of around 250 patients would allow sample selection across the full measurement range; membership to five class interval groups of around 50 patients in each group is suggested [23,24].

Rasch analysis
A preliminary psychometric evaluation was performed using both traditional psychometrics in line with proposed US Food and Drug Administration (FDA) criteria [9] and new psychometric methods, Rasch Measurement Theory (RMT) [25]. RMT is increasingly used in the development of PRO instruments [26,27] as it provides a formal method for evaluating scale functioning against a sophisticated mathematical measurement model, the Rasch model [25].
A Rasch analysis, using the Andrich Rating Scale Model [28], was performed using RUMM2030 [29]. The following properties of the provisional PU-QOL version were examined: mode of administration (patient self-completed or researcher administered; data will be published separately), scale targeting, item response categories, item series (e.g. item-fit) and response bias, to guide scale construction and identify items with poor psychometric properties for possible elimination. PU-QOL data was tested against model expectations and any deviations were examined to determine whether scales could be improved. Final decisions on item inclusion/exclusion were made according to appraisals of the analyses against measurement criteria (Table 1) and clinical relevance (the extent to which items within proposed scales are clinically cohesive), as opposed to examinations carried out singularly or sequentially.

Traditional analysis
The 10 Rasch constructed scales underwent a preliminary psychometric evaluation using traditional psychometric tests [9,11,22] for: acceptability, scaling assumptions, reliability, and validity. SPSS 15.0 software was used for these analyses. Psychometric tests and criteria are summarised in Table 1.

Field test two design Sample
The second field test was undertaken to perform a comprehensive psychometric evaluation of the final (10 scale/ 83-item) PU-QOL in a large independent sample of patients with PUs (eligibility criteria and methods as for field test 1.1). A sample of around 250 patients would provide sufficient participants to estimate test-retest reliability; correlations at levels expected in test-retest situations (e.g. r >= 0.80) can be estimated with reasonable precision (95% confidence intervals of ±0.1) with relatively few subjects [46,47].
• Literature review-generated broad HRQL domains/important PU-specific PROs identified • Clinical expert opinion-produced a working conceptual framework • Qualitative interviews (n=30) with people with PUsrevised conceptual framework • Clinical and methodological expert review = produced a conceptual framwork of HRQL outcomes

Item Generation
• Semi-structured cognitive interviews with people with PUs (n=35)-identified problems with item content (ambiguous, confusing), layout and administration • Revised instrument based on patient recommendations • Clinical and methodological expert review = produced a first draft PU-QOL instrument

Pre-testing
• Item analysis and scale construction (n=227) • Rasch analysis followed by traditional psychometric tests for reliability and validity • Sub-sample (n=75) randomised to self-complete or administered groups and differential item functioning performed to determine the best administration mode for this population • Clinical and methodological expert review-considered analyses in combination and clinical relevancemodified first draft PU-QOL 5a. Field Test 1-

5b. Mode of Administration Study
• Final psychometric analysis (n=229) • Rasch analysis followed by traditional psychometric tests for reliability and validity • Clinical and methodological expert review = produced final version PU-QOL 6. Field Test 2 Figure 1 Steps towards developing and evaluating the PU-QOL instrument.

Psychometric property Traditional methods -test and criteria Rasch methods -test and criteria
Acceptability and data quality -Completeness of item-and scale-level data.
• Score distributions (floor/ceiling effects and skew of scale scores) • Even distribution of endorsement frequencies across response categories (>80%) • % of item-level missing data (<10%) [30] • Low number of persons at extreme (i.e. floor/ ceiling) ends of the measurement continuum • % of computable scale scores (>50% completed items) [31] • Items in scales rated 'not relevant' <35% Scaling assumptions -Legitimacy of summing a set of items (items should measure a common underlying construct).
• Similar item mean scores [32] and SDs [33] • Positive residual r between items (<0.30) • High negative residual r (>0.60) suggests redundancy • Items have similar ITCs [34] • Items sharing common variance suggests uni-dimensionality • Items do not measure at the same point on the scale • Evenly spaced items spanning whole measurement range Item response categories -categories in a logical hierarchy.
• NA • Ordered set of response thresholds for each scale item Targeting -extent to which the range of the variable measured by the scale matches the range of that variable in the study sample.
• Scale scores spanning entire scale range • Person-item threshold distribution: person locations should be covered by items and item locations covered by persons when both calibrated on the same metric scale [35] • Floor and ceiling (proportion sample at minimum and maximum scale score) effects should be low (<15%) [36] • Skewness statistics should range from −1 to +1 [37] • Good targeting demonstrated by the mean location of items and persons around zero • No published criteria for item level targeting

Reliability
Internal consistency -extent to which items comprising a scale measure the same construct (e.g. homogeneity of the scale).
• Cronbach's alphas for summary scores (adequate scale internal consistency is ≥0.70 [22] • High person separation index >0.7 [38]; quantifies how reliably person measurements are separated by items • Item-total r between +0.4 and +0.6 indicate items are moderately correlated with scale scores; higher values indicate well correlated items with scale scores [22] • Power-of-tests indicate the power in detecting the extent to which the data do not fit the model [24] • Items with ordered thresholds *Test-retest reliability -stability of a measuring instrument. • Intra-class r coefficient >0.70 between test and retest scores [11] • Statistical stability across time points (no uniform or non-uniform item DIF (p=>0.05 or Bonferroni adjusted value)) • Pearson r: >0.7 indicates reliable scale stability

Validity
• Involves accumulating evidence from different forms Content validity -extent to which the content (items) of a scale is representative of the conceptual construct it is intended to measure.
• Consideration of item sufficiency and the target population • Clearly defined construct • Qualitative evidence from individuals for whom the measure is targeted, expert opinion and literature review (e.g. theoretical and/or conceptual definitions) [9]. • Person fit residuals within given range +/−2.5

Rasch analysis
A Rasch analysis was performed on all 10 PU-QOL scales.
In addition to the properties examined in for field test 1, differential item functioning (DIF) was also assessed. DIF occurs when people from different groups (e.g. gender) with the same latent trait (e.g. pain) have a different probability of giving a certain response to an item [44]. Groups to be studied were selected based on theoretical considerations about whether or not the construct measured by each PU-QOL scale was hypothesised to have the same conceptual meaning across groups.

Traditional analysis
The final PU-QOL version underwent traditional psychometric analyses as described in for field test 1. Additional tests for reliability (test re-test) and validity, including both within-and between-scales testing (convergent, discriminant, known groups) were undertaken (Table 1). To minimise respondent burden, the SF-12v2 Acute, English (UK) version was used [48] to examine convergent validity.

Results
Field-test one: scale construction and preliminary psychometric evaluation Sample The first field test screened 989 patients from 21 hospitals, 10 community services and one hospice. Of those screened, eligibility was assessed for 787 (79.6%); 416 were considered eligible (52.9%); and of those eligible, 287 (69.0%) consented to participate; however, 60 were excluded from analysis as they self-completed the Table 1 Psychometric tests and criteria used in the evaluation of the PU-QOL instrument (Continued) Measurement continuum -extent to which scale items mark out the construct as a continuum on which people can be measured.
• NA • Individual scale items located across a continuum in the same way locations of people are spread across the continuum [26] • Items spread evenly over a reasonable measurement range [40,41]. Items with similar locations may indicate item redundancy Response dependency -response to one item determines response to another.
• NA • Response dependency is indicated by residual r >0.3 for pairs of items [40,41] ii) Between scale analysis Criterion Validity -hypotheses based on criterion or 'gold standard' measure.
• There are no true gold standard HRQL [42], PU-specific or chronic wound-specific measures available [12] • NA • NA *Known groups differences -ability of a scale to differentiate known groups •^Generate hypotheses (based on subgroups known to differ on construct measured) and compare mean scores (e.g. predict a stepwise change in PU-QOL scale scores across 3 PU severity groups and that mean scores would be significantly different) • Hypothesis testing (e.g. clinical questions are formulated and the empirical testing comes from whether or not data fit the Rasch model) • Statistically significant differences in mean scores (ANOVA) *Differential item functioning (item bias) -The extent of any conditional relationships between item response and group membership.
• NA • Persons with similar ability should respond in similar ways to individual items regardless of group membership (e.g. age) [44] • Uniform DIF -uniformity amongst differences between groups • Non-Uniform DIF -non-uniformity amongst differences between groups; can be considered at 1% (Bonferroni adjusted) and 5% CIs PU-QOL (data on the self-completed sample will be published elsewhere). Cognitive impairment was the main reason for ineligibility (38.8%). Table 2 presents the sample characteristics.

Rasch analysis: item reduction and scale formation
The first psychometric evaluation produced a 10-scale instrument ( Table 3). The Rasch analysis detected important limitations of the PU-QOL scales, resulting in modifications. It detected that the four-category item scoring function did not work as intended for multiple items. For those items where the response categories were working as intended, thresholds were close to being disordered; people had difficulty distinguishing between 'a little bother' and 'quite a bit of bother' categories. This provided good evidence that items would benefit from fewer response categories. All scale items were subjected to a post hoc rescoring by collapsing adjacent categories.
Re-analysis demonstrated that all thresholds were now correctly ordered, producing scales with three categories (0 = no bother, 1 = little bother, 2 = a lot of bother).
Targeting between the distribution of person measurements and item locations indicated that the samples were adequate for examining the scales but the scales were suboptimal for measuring the sample. Significant ceiling effects indicated that scales might provide limited information about people at the extremes of the sample distribution (those with least disability/ impairment). However, the location ordering of scale items was clinically sensible, providing evidence towards construct validity. Some items had notable criterion failures: fit residuals outside +/−2.5; high chi-squared values with significant p-value, and significantly under-or overdiscriminating item characteristic curves (Table 3). Few items exceeded +/−0.3 residual correlations, indicating that item responses are independent of each other and no redundant items. Departures from item fit expectation were considered in combination and guided item removal. Person separation index values indicated good to reasonable reliability for scales distinguishing between responders on each scale variable (Table 3).

Traditional analysis
A preliminary psychometric evaluation against traditional psychometric criteria supported the PU-QOL scales as reliable and valid measures of PU-symptoms, physical and social functioning, and psychological well-being. Briefly, data quality was high (scale scores were computable for 93-99.6% of respondents) and scaling assumptions were satisfied (similar mean item scores, corrected item-total correlations ranged 0.53-0.92). Scale-to-sample targeting was good (scale scores spanned the scale range but were notably skewed for three scales (values outside +/−1.0), mean scores were near scale mid-points for 67% of scales,  Table 3). The item-total correlations, alpha coefficient and homogeneity coefficient (inter-item correlation mean and range; Table 3) provide evidence towards the internal construct validity of PU-QOL scales.

Field-test two: final psychometric evaluation Sample
The second field test involved a comprehensive psychometric evaluation of the final (10 scale/83-item) PU-QOL, using RMT and traditional psychometric methods. A total of 879 patients were screened of whom eligibility was assessed for 717 (81.6%); 391 were considered eligible (54.5%); and of those eligible, 231 (59.1%) consented to participate; however two were excluded from analysis (one patient died; one patient was recruited twice). Table 2 presents the sample characteristics.

Rasch analysis
The measurement properties of PU-QOL scales were largely supported as demonstrated through items that mapped out continua of increasing intensity and located items along those continua in a clinically sensible order. Scale items work together to define single variables, albeit, some item misfit and local dependence (Table 4). DIF was demonstrated in three items (e.g. items 'difficulty standing for long periods' and 'limited in ability to go up and down stairs' from the mobility scale; Table 4), however deviations from model expectations were marginal, suggesting item performance across the four clinical subgroups is stable and that these groups can be measured on a common ruler.    The Rasch analysis detected some important limitations; the three-category scoring function did not work as intended for some scale items, indicated by disordered thresholds (e.g. items 'walking slowed' and 'limited in ability to walk' from the mobility scale; items 'regular activities' and 'jobs around the house' from the activity scale), and targeting problems emerged. Inspection of threshold distributions demonstrated sub-optimal targeting of PU-QOL scales to the study sample for most scales (items did not span the full range of the patient sample, indicating that measurement could be improved at the extreme ends of some scales; Table 4. The largest frequency of respondents was often at the ceiling of scale ranges (least bother). Ideally, there should be a good match between the scale and sample ranges, with people falling within the range of the items. As sample sizes were small for some scales (e.g. removing people with no odour bother resulted in a sample of 27 for analysis), it was deemed premature to make major modifications to items and the scoring function without additional empirical evidence.

Traditional analysis
The traditional psychometric evaluation supported the PU-QOL scales as reliable and valid measures of PUsymptoms, physical and social functioning, and psychological well-being. Total scores could be computed for most people (computable scale scores ranged 95.6-99.6%), implying good data quality. Scaling assumptions were satisfied (corrected item-total correlations ranged 0.51-0.94). All item-own-scale correlations were high (corrected itemtotal correlations ranged 0.525-0.920; Table 5) and satisfied recommended criteria (> 0.3), thus providing support that items within scales measured a common underlying construct. Corrected item-total correlation >0.3 indicated that items within scales contained a similar proportion of information. Scale-to-sample targeting was reasonable: scale scores spanned the scale ranges but were notably skewed for exudate odour and self-consciousness scales (value outside +/−1.0); mean scores were near scale mid-points for only pain, sleep and mobility scales, however due to many people responding at the floor (lowest score), this finding is expected; and ceiling effects were negligible, however floor effects exceeded the 15% criterion for exudate, odour, vitality, and appearance and self-consciousness scales.
Reliability was high as demonstrated by Cronbach's alpha values for all PU-QOL scales exceeding the standard criterion of 0.7 (Table 5). Item-total correlations ranged 0.525-0.920, fulfilling the recommended criteria (>0.3). Test-retest correlations for 8/10 scales exceeded 0.7; two scales had correlations below the recommended criteria, but marginally (Table 5), thus mostly fulfilling the recommended minimum criteria and indicating good scale stability.
Evidence of internal construct validity was supported by moderate to high item-total correlations; high Cronbach's coefficient alphas; and moderate to high inter-item correlations (means >0.48; ranges 0.226-0.934; Table 5), indicating that each PU-QOL scale measures a single construct. Hypothesised correlations between PU-QOL and related SF-12 scales were consistent with predictions (Table 5), thus providing support that scales measure what they intend to measure; moderate to high correlations (r >0.30) were predicted. Correlations between PU-QOL scales and sociodemographic variables (age, gender) were consistent with predictions (r <0.30; Table 5), suggesting responses to scales are not biased by age or gender. Hypothesised group differences were as predicted for scales: exudate, odour, vitality, daily activities, emotional well-being, and selfconsciousness, with significant step increases in mean scores observed by PU severity groups. In contrast, there was no step increase in mean scores for scales: pain, sleep, mobility and movement, and participation. Apart from the sleep scale, the mean score on outcomes for category 1 PU severity was lower than category 3/4 severity, suggesting that HRQL outcomes are worse for people with severe PUs compared to those with superficial category 1 PUs. It is important to note that category 1 PUs had small samples (range 4-14 patients) therefore known groups results are considered preliminary.

Final PU-QOL Instrument
The final PU-QOL is a self-report instrument, comprising of 10 scales. These include three symptom (pain (8 items), exudate (8 items), odour (6 items)), plus an itchiness item; four physical functioning (sleep (6 items), movement and mobility (9 items), daily activities (8 items), vitality (5 items)); two psychological well-being (emotional well-being (15 items), self-consciousness and appearance (7 items)); and one social participation scale (9 items). It is intended for administration where patients rate the amount of "bother" attributed (e.g. "During the past week, how much have you been bothered by…?") on a 3-point response scale (e.g. 0=not at all -2=a lot). Scale scores are generated by summing items and then transforming to a 0-100 scale. High scores indicate greater patient bother.

Discussion
The PU field requires a strong evidence-base that incorporates health outcome measurement from the patient perspective. To fully capture and quantify the patients' viewpoint, appropriately constructed and validated instruments are required. The PU-QOL instrument consists of 10 scales for measuring symptoms and physical, psychological, and social functioning specific to PUs. This is the first outcome measure reflecting PU-specific conceptual HRQL domains; content that differs from other chronic wound-specific instruments [12], and provides a Gender R 2 (n) Age r 2 (n) Pain (8)  framework for designing future research that consequently improves the quality of research in the field by inclusion of PU-specific PROs. Scale development and item reduction were primarily guided by RMT. RMT provides a powerful framework to guide scale construction by detecting items deviating from model expectations with the intention of improving scale attributes. Evidence from RMT was used to understand why some scale items were not working and to pin point where improvements could be made. However, final decisions on item inclusion were made according to appraisals of the analyses of the observed data against measurement criteria and clinical relevance, as opposed to examinations carried out singularly or sequentially.
The final psychometric evaluation demonstrated that PU-QOL scales mostly satisfy criteria for acceptability, reliability and validity, in line with recommended FDA guidelines for measurement [9]. However, the Rasch analysis detected targeting problems despite attempts to sample a wide variety of patients with PUs drawn across settings. Targeting is justified for the exudate and odour scales as not all patients have these problems; it is clinically reasonable that these people fall outside the scale range. Importantly, where people have symptom bother, there needs to be items within the scales that discriminate symptom bother, and in this instance, the symptom scales perform this function. For the remaining scales, targeting could be improved by developing items that span a wider measurement range, and in the process, maximise the potential of the PU-QOL to detect change. Extending the measurement range can be achieved without affecting the scales as they stand, because the item locations are calibrated relative to each other. Important to note, scale scores for >65% of the samples were within the best performing part of all scales. For example, the pain scale items spread 2-logits compared to a person spread of 7 logits, indicating suboptimal targeting. But for the majority of people in the sample, the measurement range distribution was within the range where most people lay, indicating good pain scale performance.
Given the heterogeneity of the population with PUs, further work is required to ensure that the PU-QOL scales fit the needs of all people with PUs including patients with superficial PUs. Appropriateness of PU-QOL's use in individual decision-making needs investigation; strengthening the measurement precision could improve the PU-QOLs ability to detect differences in HRQL outcomes between people with different PU severity. This is important for making inferences from future research using the PU-QOL. However, one consideration is that during field testing, as is standard practice, patients received some form of treatment for their PU; information that was not collected (e.g. amount of analgesia). Therefore, the true impact of PUs may not have been captured (lower severity represented in the sample due to treatment effect) and be the reason for, at least in part, mistargeting and misrepresentation of known groups testing. In actual fact, PUs appear to cause patients more bother (as indicated from the qualitative work) than was represented but good care received lowered PU impact in the sample. This is a methodological issue in this area. Finally, the three-category scoring function did not work as intended for some scale items and requires exploration. The above limitations do not preclude use of the PU-QOL instrument. PU-QOL scales can be included as one outcome measure, amongst others, for group comparisons in future PU research (e.g. clinical trials) on the proviso that studies have built in a parallel psychometric analysis to indicate the performance (psychometric evaluation) of the scales in future samples.
The final Rasch analyses provides an initial evidencebase for future testing to improve the PU-QOL scales and to establish the extent that psychometrically sound scales have been developed. Future scale developments can be empirically driven; the distribution of item locations highlight where 'gaps' in the measurement continuum are (fill notable distances in item locations with items, particularly those representing superficial PU impact and extend the measurement range at the extreme ends of the continuum). The process of modifying a newly developed instrument is part of an evolving, ongoing measurement process intended to strengthen the hypothesised conceptual relationships with empiric evidence [49]. The usefulness of new measures is therefore demonstrated by multiple applications in different studies (accumulative body of evidence to support scale measurement properties). Future research will investigate the sensitivity of PU-QOL scales to change and responsiveness, and develop an instrument to enable economic evaluation. Development of proxy measures and language translations are needed given the high prevalence of cognitively impaired patients with PUs.
The PU-QOL instrument is intended for administration, following a user manual, with adults across the range of PU severity and type (location and duration) and UK acute and community healthcare settings. Scales can be selected depending on the nature of the research and scale items are summed to produce scores. The PU-QOL can be used for: effectiveness intervention research where improvement and/or deterioration in HRQL is measured; promoting patient-clinician communication (i.e. flag issues); informing changes to treatment; facilitating priority setting and patient care and PU management decisions and assessing the care given from the patient's perspective. Currently, the PUQOL is most appropriate for people with severe PUs, as demonstrated by a lack of items to represent people with little or no bother due to PUs. The exudate and odour scales are not intended for people with superficial category 1 PUs. Electronically defined 'skip' questions would assist in selecting scales and items relevant to each individual's circumstance.
As the PU-QOL was developed and evaluated in the UK, the validity and reliability are characteristics of the instrument for a specific population (i.e. UK nationals) and should therefore be re-evaluated for a new population. A language translation or cross-cultural adaption may be required to ensure that the PU-QOL is appropriate for cultures, languages and ethnic groups outside the UK (see the PU-QOL instrument website for guidance on language translation and cross-cultural adaptation processes: http://ctru.leeds.ac.uk/Skin).
This research highlights the importance of fully testing instruments before clinicians and researchers apply them. It highlights the value of item-level analyses, not typically undertaken, that identified problems with the PU-QOL scales not detected by standard tests of scale reliability and validity. It also demonstrated that small iterative steps, using mixed methods in an interactive way, rather than the traditional three stage approach to PRO development (i.e. qualitative work to generate constructs and content, pre-testing and psychometric evaluation) may be beneficial, particularly at early content and scale format/design to understand and resolve instrument issues early in the development process. Both qualitative and empirical findings should be used to inform subsequent work and to make improvements to scales. Uniformity of research approaches for PRO development could lead to consistency in health measurement and the inclusion of mixed methods as well as the more sophisticated psychometric methods, such as RMT in accepted international guidelines.

Conclusions
This study makes important contributions to the PU and wider health measurement fields. The findings demonstrate that mixed methods, including RMT were beneficial for developing a new PRO instrument specific for PUs; a methodology that can be applied for further development of the PU-QOL as well as PROs in other health areas. The PU-QOL instrument provides a means for the comprehensive assessment of PU impact and for quantifying the benefits of PU interventions from the patients perspective; thus far lacking in the area. A scientifically rigorous PRO measurement needs to become more commonplace in the PU field so that the goal of PU management can be to enhance and maintain the HRQL of people with PUs. Subject to further development, PU-QOL is a tool with which to evaluate whether PU treatments and the healthcare given achieve this; outcomes that are ultimately best judged by patients themselves.