Measuring quality of life in Duchenne muscular dystrophy: a systematic review of the content and structural validity of commonly used instruments

Duchenne muscular dystrophy (DMD) is an inherited X-linked neuromuscular disorder. A number of questionnaires are available to assess quality of life in DMD, but there are concerns about their validity. This systematic review aimed to appraise critically the content and structural validity of quality of life instruments for DMD. Five databases (EMBASE, MEDLINE, CINAHL, PsycINFO, and Cochrane Library) were searched, with supplementary searches in Google Scholar. We included articles with evidence on the content and/or structural validity of quality of life instruments in DMD, and/or instrument development. Evidence was evaluated against the Consensus-based Standards for the selection of health Measurement INstruments (COSMIN) criteria. Fifty five articles featured a questionnaire assessing quality of life in DMD. Forty instruments were extracted and 26 underwent assessment. Forty-one articles contained evidence on content or structural validity (including 37 development papers). Most instruments demonstrated low quality evidence and unsatisfactory or inconsistent validity in DMD, with the majority not featuring direct validation studies in this population. Only KIDSCREEN received an adequate rating for instrument design and a satisfactory result for content validity based on its development, yet, like the majority of PROMs, the measure has not been directly validated for use in DMD. Further research is needed on the validity of quality of life instruments in DMD, including content and structural validity studies in this population.


Introduction
Duchenne muscular dystrophy (DMD) is an X-linked neuromuscular disorder with an estimated incidence of 1 in 3802-6291 live male births [1,2]. The disease causes progressive muscle weakness due to an absence of the dystrophin protein, which functions to help keep muscle cells intact. Diagnostic symptoms and functional impairment are evident from as early as two years old and average life expectancy of people with DMD is approximately 25 years [3], although increasingly people with DMD are surviving into their fourth and even fifth decades [4]. The disease progresses through four recognised clinical stages characterised by increased muscle weakness, impaired ambulation and motor functioning, and cardiovascular and respiratory problems [5]. There is no cure for the disease. Current clinical efforts are focused on slowing disease progression and improving the health-related quality of life (QoL) of people with DMD, and health interventions are necessarily evaluated for their cost effectiveness against this objective.
In order to attempt to measure QoL in people with DMD a number of both generic (such as the EQ-5D [6,7]) and condition-specific (such as the MDCHILD [8]) patient reported outcome measures (PROMs) are used. However, concerns have been raised about the validity of existing PROMs to comprehensively assess QoL in DMD [9]. Given that a number of generic and condition-specific questionnaires are available, researchers and clinicians have to make a critical choice on which measure may be most appropriate for assessing QoL in people with DMD. In order to help inform this decision, evidence-based guidance is needed on the relative validity and psychometric performance of these instruments. There are a number of reviews exploring QoL and associated measures in DMD, with some providing very basic information on their psychometric properties [9,10]. However, no reviews to date have appropriately evaluated the content validity of available measures when it comes to assessing QoL in DMD. This is a striking omission; content validity has been defined by the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) group as the most important property of a PROM [11][12][13]. Furthermore, prior reviews on QoL in neuromuscular disorders have either not referred to, or used an outdated version of, COSMIN guidance. In the current review we used up-to-date COSMIN methodology to assess the content and structural validity of QoL PROMs in DMD [11].
Content validity refers to the extent that the content of a PROM adequately reflects the target construct that is intended to be measured [14]. It can be subdivided into the judged 'relevance', 'comprehensiveness', and 'comprehensibility' of a PROM, in assessing the construct of interest within a target population and context [13]. 'Relevance' of a PROM refers to whether the items are relevant for the construct, target population, and context of use of interest; the response options and recall period of a PROM should also be appropriate and relevant. 'Comprehensiveness' is used to describe the extent to which all key aspects of the construct of interest are covered in the PROM. Finally, 'comprehensibility' pertains to the understanding of the items and response options by the population of interest [13].
A thorough assessment of a PROM's content validity should include studies presenting information on content validity in the population of interest, but also consider the initial PROM development paper(s) (i.e. literature describing studies on the development of the PROM) and the content of the PROM itself [12,13]. The consideration of development studies is important, because the quality of how the PROM was developed (e.g. was there a clear description of the construct to be measured? were patients involved? etc.) necessarily has an impact on the evaluation of the content validity of a PROM in its subsequent use. Thus, COSMIN recommends that the quality of PROM development is rated and assessed prior to the quality of any content validity studies [13]. Furthermore, content validity should form the first step of the assessment of the validity of a PROM, as it is integral to that PROM's usefulness in doing the job it was designed to do, and influences all other measurement properties [15,16]. For example, a psychometrically responsive and internally consistent instrument is of little use if it is not measuring what it is intended to measure.
COSMIN guidance states that the second most important form of the validity assessment of a PROM is structural validity [15,16]. Structural validity describes the extent that scores derived from a measure adequately reflect the dimensionality of the construct being measured [17]. Quality of life is usually defined, and thus measured, as a multidimensional construct. Therefore, PROMs that feature multiple dimensions of QoL should be assessed to check they accurately represent the multidimensional structure of QoL in the population of interest. If PROMs are designed to target a single dimension of QoL, assessments should be undertaken to empirically demonstrate their unidimensional nature in the target population. If such assessments are not undertaken, subsequent interpretation of the data (e.g. through generating dimensional scores) may be inaccurate. For the purposes of this review, we define QoL as a multidimensional construct involving physical (e.g., pain, fatigue), psychological (e.g., mood, self-efficacy), and social (e.g., participation, stigma) components, based on the Comprehensive Model of QoL in Muscular Dystrophy (CMQM) [9], and use this to define the construct of interest. We choose to define QoL as a subjective construct and do not include purely functional performance or assessment scales that may impact on QoL. In this review, we consider multi-item PROMs that assess at least one aspect of QoL in people with DMD.
When evaluating a PROM, content and structural validity can be meaningfully assessed against up-to-date published standards by the COSMIN group, derived from international expert consensus [12,14]. These ratings incorporate actual evidence on PROM validity and the quality of that evidence. For example, regarding a PROM's 'comprehensiveness', a positive rating can be given based on a content validity study if: (i) the study quality was not rated as inadequate; (ii) patients or professionals were interviewed; and (iii) no key concepts were missing. For structural validity, a positive rating is given if good model fit is observed in CFA or in IRT/ Rasch (see Methods), and can be appraised alongside a rating for the study's quality. The full COSMIN standards and methodology for assessing PROMs are comprehensive and available in accompanying guidance manuals [13,16], which were adhered to when conducting the current review.
This systematic review has been designed to evaluate the content and structural validity of QoL measures used in people with DMD using updated COSMIN guidance [13,16], to provide researchers and clinicians with a robust evidence-base to help them when selecting PROMs to measure QoL in the Duchenne population. The review makes a unique contribution to the literature in being the first to assess the content validity of PROMs used in DMD and to apply an up-to-date and thorough COSMIN assessment of these measures. There are two main questions being addressed: 1) Which PROMs have been used to assess QoL in published research with boys and men diagnosed with DMD? 2) What is the content and structural validity of these PROMs for use in assessing QoL in boys and men with DMD?

Methods
The review protocol was registered on PROSPERO [18]. This systematic review has been reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) checklist [19].

Search strategy and selection criteria
This review contains two searches. The first search (Search A) was designed to identify PROMs used to measure QoL in DMD in peer-reviewed publications. The second search (Search B) was used to identify literature reporting on the measurement properties of these PROMs in DMD. Search B also included the recommended practice of searching for the development papers of PROMs to enable a full COSMIN assessment of their content validity [12,13,15]. Full copies of the searches are contained in Additional file 1, for reproducibility.

Search A and selection criteria
Search A was conducted on 11th April 2018, searching EMBASE, MEDLINE, CINAHL, PsycINFO, and the Cochrane library, from inception. No restrictions on date or language were applied to the search. Search A terms are illustrated in full in Additional file 1 and included: (I) Duchenne muscular dystrophy (and Duchenne*) AND ((II) a search filter provided by the PROM group at the University of Oxford to identify PROMs (available online [20] and in Additional file 1) OR (III) PROMs known to be used in people with DMD based on a prior rapid review of the literature [21]. The following selection criteria were applied to the results of Search A by two independent reviewers: (I) published in English as a full-text original research article (i.e. not including abstracts, editorials, or reviews); (II) used a self-reported, multi-item PROM to assess at least one aspect of QoL in males diagnosed with DMD (assisted or proxy-reported versions of PROMs were considered for inclusion so long as a self-report version of that PROM exists); and (III) in case of studies involving mixed clinical samples, at least 75% of the sample (or subgroup), on which data from the PROM was reported, was male diagnosed with DMD. The inclusion criteria were first applied to titles and abstracts of the hits from Search A. Records were selected for full-text review if they matched the selection criteria, potentially matched the criteria, or if doubt existed. Any discrepancy was resolved by a third reviewer. Full text articles were then screened for selection using the selection criteria by two independent reviewers. Any disagreements were resolved by a third reviewer through discussion. Finally, the PROMs themselves identified in the articles were reviewed by two independent reviewers to ensure they met the requisite inclusion criteria (i.e. assessing an aspect of QoL).

Search B and selection criteria
Search B was conducted on 18th September 2018, with initial searches on EMBASE, MEDLINE, CINAHL, Psy-cINFO, and the Cochrane library, from inception. No restrictions on date or language were applied to the search. Search B terms are illustrated in full in Additional file 1 and included: term (I) from Search A AND ((II) PROMs identified in Search A OR term (III) from Search A) AND (IV) a search filter 1 by the COSMIN group for identifying studies on measurement properties [22] (available online [23] and in Additional file 1). Over and above that of Search A, the following additional selection criteria was applied to the results of Search B: (IV) described data on the content and/or structural validity of the PROMs identified in Search A in males diagnosed with DMD; (V) included a PROM validated in English, with a free/review copy available to access.
As recommended in the COSMIN approach [15], follow up searches were conducted on Google Scholar to identify key development papers for the PROMs identified in Search A and taken forward for review (see Section 3.2). Google Scholar was searched (last searched 14th November 2018) with the names and acronyms of the PROMs (version numbers omitted) and the first 100 hits were screened for inclusion [15]. Search results were initially screened by title, with any relevant and potentially relevant papers exported to a database. Following the removal of duplicates against the primary searches, records were screened by abstract and then full text against selection criteria. As per COSMIN guidance, development papers for the PROMs were not subject to any of the inclusion criteria noted above and were included in any published form [13,15]. Results of the searches were screened for inclusion by two reviewers.
Finally, citation tracking of all eligible articles identified in Search B was conducted by reviewing references and citations on Google Scholar (last searched 6th February 2019) for any articles not identified in the initial searches that may meet the inclusion criteria. All references and citations were reviewed, except where citations became unmanageable (i.e. > 500 citations), when "Duchenne" was searched for within the citing articles to filter the hits for manageable review. Search results were initially screened by title, with any relevant and potentially relevant papers exported to a database. Following the removal of duplicates against the primary searches, records were screened by abstract and then full text against selection criteria. Results of the searches were screened for inclusion by two reviewers.

Data extraction and COSMIN risk of bias assessment
Data extraction was undertaken by two reviewers using a pre-prepared data extraction sheet, with consensus on any ambiguities reached through discussion. The data extraction sheet was informed by the tools developed by COSMIN on reporting guidance [16], and included study characteristics (authors, year, journal, country, language, study type); details of the PROM used (name, mode of administration, recall period, total N subdomains, subdomain names, total and subdomain N items, total and subdomain response levels, total and subdomain score ranges); DMD sample characteristics if applicable (N, age, percentage ambulatory, total and subdomain PROM score, total and subdomain observed ranges); details of PROM development if applicable (construct definition, target population, original language, intended context of use, patient involvement); details of content validity results if applicable (summarised results, e.g. findings from a cognitive debriefing exercise); and details of structural validity results if applicable (analytic model, summarised results, e.g. fit statistics, tests of model assumptions for IRT/Rasch).
The methodological quality of the PROM development papers, and studies on content and structural validity were assessed (at the study level) using up-to-date COS-MIN standards via the new COSMIN risk of bias checklist [24]. A total rating for relevance, comprehensiveness, and comprehensibility (content validity aspects) of a PROM is determined separately, alongside a total rating for the methodological quality of a structural validity study [24]. When rating the methodological quality of the studies, each COSMIN standard (or item) is ranked on a 4-point scale: "very good", "adequate", "doubtful", and "inadequate". Total ratings are determined using the lowest rating for any item for that study (i.e. worst score counts) [25]. Studies were initially rated independently by two reviewers, and, in the case of divergence, consensus was reached in a subsequent faceto-face meeting. This information on risk of bias is used to inform quality of evidence (see section 2.2).

Assessment of measurement properties
In order to synthesise and assess evidence on content validity, two reviewers independently rated the results of PROM development studies, content validity studies, and the content of the PROM itself on 10 COSMIN criteria [13], agreed upon by international consensus [12]. These criteria included: whether the included items were relevant for (I) the construct of interest, (II) the population of interest, and (III) the context of use of interest; whether the (IV) response options and (V) recall period were appropriate; whether (VI) all key concepts were included; whether (VII) the PROM instructions and (VIII) PROM items and response options were understood by the population of interest as intended; whether (IX) the PROM items were appropriately worded; and whether (X) the response options matched the question. Ratings for each source of evidence were made separately, using COSMIN guidance [13] (p.54) and could either be positive (+), negative (−), or indeterminate (?). Reviewers' ratings were made based on the judgement of the researchers, who have experience in PROM design and work with people with DMD, including direct qualitative research [26,27]. When reviewers considered whether the items were relevant or comprehensive for the construct of interest, they were compared against the CMQM [9]. Accordingly, a PROM would be sufficiently comprehensive (+) if it included items covering physical, psychological, and social aspects of QoL. When judging the appropriateness of the recall period, reviewers considered any defined recall period of up to 4 weeks as appropriate (+), as children aged 8 years and above can recall up to this length of time with sufficient accuracy [28]. When rating the appropriateness of response options, bearing in mind the target sample (i.e. a child or adult PROM), reviewers took into account the numerical range, how the response options were visually displayed, and the perceived cognitive complexity of the options (including wordiness, degree of variation throughout the questionnaire, and the use of reversed ordering).
Following the above assessment, an overall (qualitatively synthesised) judgment on the relevance, comprehensiveness, and comprehensibility of each PROM was made, which could be sufficient (+), insufficient (−), or inconsistent (±), using COSMIN guidance [13] (p.58). For example, if all sources of evidence were rated positive (+) for relevance, then the overall rating for the PROM would be sufficient (+). As recommended by COSMIN [13], more weight was given to content validity studies, then development studies, then ratings of the PROM by reviewers. Ratings were compared and combined across the two reviewers by consensus. As per COSMIN guidance [13], only available evidence was taken into account when assessing content validity, so, for example, if there were no content validity studies in DMD available for that PROM, assessment was made based on the ratings of any PROM development studies and the ratings of reviewers. The fact that the PROM had no content validation studies in DMD is then reflected in a lower quality of evidence rating (see below). An example content validity rating spreadsheet for the KIDSCREEN-52, including the rules for synthesising the individual ratings is included in Additional file 2.
Evidence on structural validity was assessed against the updated COSMIN criteria for good measurement properties, using the same rating scale as above [16]. Specifically, a positive (+) rating would be given for a CFA featuring a CFI, TLI or comparable measure > 0.95 OR RMESA < 0.06 OR SRMR < 0.08. For an IRT/Rasch model, a positive (+) rating would be given for no violation of unidimensionally (e.g. assessed with the fit statistics above) AND no violation of local independence (e.g. residual correlations among items after controlling for the dominant factor < 0.20) AND no violation of monotonicity (e.g. evidenced graphically or item scalability > 0.30) AND adequate model fit (e.g. χ 2 < 0.01, infit/outfit mean squares ≥0.5 and ≤ 1.5 OR Z-standardized values > − 2 and < 2. A negative (−) rating would be given if these criteria were not met in the data and an indeterminate (?) rating would be given if model fit was not reported.
Finally, the quality of the evidence was graded using a modified GRADE approach [29], as either "high", "moderate", "low", or "very low". The GRADE approach takes into account the risk of bias of studies (or study quality); (in) consistency across studies; imprecision (based on sample sizes); and indirectness (of evidence) [16]. The evidence is assumed to be high, then is downgraded by 1-3 points based on the degree of risk of bias (including quality and absence of content validity studies), 1-2 points based on inconsistency, and 1-2 points based on indirectness. Further details on how to apply all of the above criteria are provided elsewhere in comprehensive manuals, which were followed when conducting this review [13,16]. The quality of this systematic review itself was appraised against a recently developed COSMIN checklist to assess the quality of systematic reviews of health-related PROMs [22].

Results of search A -PROMs used to measure quality of life in DMD
After removing duplicates, 1733 records were identified through database searching for Search A. Of these, 1521 were excluded at the title/abstract review stage, leaving 212 papers for full-text review. Of these 212 papers, 84 were excluded as they were not full-text published research articles; 25 did not meet the required sample criteria of at least 75% of the sample being boys or men with DMD; 21 were judged not to be assessing QoL; 16 were not published in English; and finally 11 papers did not feature a multi-item PROM. Five articles were additionally excluded during the review of the actual PROM used in the manuscript for not assessing QoL. Accordingly, a total of 50 records from the initial searches met the selection criteria for Search A. A further 5 articles that met the selection criteria for Search A were added as a result of citation tracking, giving a total of 55 records. Table 1 summarises the PROMs used to assess QoL in DMD from the full-texts meeting the selection criteria at Stage 1 (n = 55). A total of 40 PROMs used to assess at least one aspect of QoL in DMD were identified in published research articles through database searching (the two HUI classification systems use the same 15-item PROM). The majority of the PROMs were multidimensional (n = 32), designed to assess a range of different facets of QoL. The remaining unidimensional scales were designed to assess: activity limitations (CALI); anxiety (GAD-7); depression (BDI, DIKJ, PHQ-9); fatigue severity (FSS); life satisfaction (SWLS); or quality of life/ health-related quality of life unidimensionally (KIDS CREEN, SOLE). Twenty-four of the PROMs had versions designed for completion by adult or young adult respondents, and 26 had versions designed for children. The most popular PROMs used in published research articles assessing QoL in people with DMD were the PedsQL 4.0 GCS (18 articles); PedsQL 3.0 NMM (10 articles); and the SF-36 (8 articles).

Results of search Bevidence on measurement properties of PROMs
After removing duplicates, 92 records were identified through database searching for Search B. Of these, 51 had already been excluded during Search A. Eighteen unique records were found, 14 were excluded at title/abstract review stage, leaving 4 papers for full-text review. Of these 4 papers, 3 were excluded because they were  not full-text primary research articles; and 1 did not meet the sample selection criteria. Of the remaining 23 records that had already met the selection criteria for Search A, 11 were excluded at full-text for containing information on measurement properties other than, and not including, content and structural validity; 4 were excluded as not containing information on measurement properties; and 3 for including a PROM that did not have a validated English copy that was free and/or available for review. The remaining 5 papers that met the selection criteria for Search B featured evidence on content validity (n = 3, of which one was classified as a development paper) and structural validity (n = 2). Finally, 33 PROM development papers were identified through a review of Google Scholar search results and 3 PROM development papers were identified through citation tracking, resulting in a final selection of 41 papers that met the selection criteria for Search B (see Fig. 1). These included 37 development papers, 2 content validity studies in DMD, and 2 structural validity studies in DMD.
The observed proportionate agreement between reviewers during selection, based on the primary database searches, was 92.4% at title/abstract, with Cohen's κ = 0.51 or "moderate agreement" and is similar to other published reviews [84,85]. At full-text review, the observed proportionate agreement was 93.5% with Cohen's κ = 0.82 or "almost perfect agreement".
Following the searches, 26 PROMs were taken forward for COSMIN quality assessment on content and structural validity in DMD ( Table 2). The remaining 14 PROMs were not assessed for the following reasons: a copy of the PROM itself and/or necessary development papers were not freely accessible for review (CAPE, CHQ-PF50, DISABKIDS Smileys, OSIQ, SF-36 v2); no formally validated English copy of the PROM was available or in use (AUQEI, DIKJ, DUC-25, SOLE, TAAQoL, TACQoL); the PROM was no longer available or recommended for use (BASC 1st edition, which has been superseded by the BASC 2); or it was unclear from the study which of a large number of possible variants of a PROM were used (pediatric Neuro-QoL, Neuro-QoL). Table 2 summarises key characteristics and COSMIN quality assessment of the development of the PROMs included in the review. Five PROMs were developed to be intended for use specifically within neuromuscular disorders (INQoL, PedsQL 3.0 NMM) or DMD (LSIA, MDCHILD, PedsQL 3.0 DMD module). Eleven PROMs either had no patients involved in their development, or it was unclear if patients were involved.

Content validityappraisal of PROM development studies
The joint most common COSMIN quality rating assigned to the PROMs for concept elicitation was inadequate (n = 12). This was primarily due to: the PROM development study not being performed in a sample of patients representing the target population (BDI, EQ-5D-3L, GAD-7, HADS, HUI 15Q, PedsQL 3.0 MFS, PHQ-9, SDQ, SF-36, and SWLS); or inadequacies within the details of the qualitative methods used (FSS, INQoL). The concept elicitation study of 11 further PROMs was rated as doubtful due to at least some unclear details/ suspected problems within the qualitative methods used   Table 3). Neither of these studies were conducted in an English language context, and instead were cross-cultural validation studies. Hu et al. (2013) [67] assessed the relevance, comprehensiveness, and comprehensibility of the PedsQL 3.0 NMM in Chinese children with DMD. Simon et al. (2017) [56] assessed comprehensibility of the LSIA in Brazilian children with DMD, and comprehensiveness in professionals. However, both of these studies received ratings of doubtful due to at least some unclear details/suspected problems within the qualitative methods used.

Content validity evidence synthesis
The evidence from the PROM development papers and content validity studies was combined with reviewer ratings of the PROMs to produce a synthesis of the available evidence using the 10 COSMIN criteria for good content validity [13]. Most of the quality of the evidence was downgraded from High to Low or Very Low due to the assessment being based on development studies of doubtful or inadequate quality, respectively [13]. Only the LSIA and the PedsQL 3.0 NMM had moderate supporting evidence, featuring independent content validity studies as well as development papers. The KIDSCREEN measures and the LSIA were the only PROMs to receive satisfactory results for all three dimensions of content validity: relevance; comprehensiveness; and comprehensibility, based on the evidence available. Full synthesised results are presented in Table 4.

Structural validity -appraisal of structural validity studies
Two studies had assessed the structural validity of the PROMs included in this review in samples of people with DMD ( Table 5). Both of these were conducted using English versions of the PROMs and either in the UK or USA. DMD. This study received a very good COSMIN quality rating for its methodological content.

Structural validity evidence synthesis
Of the 2 studies that assessed the structural validity of the PedsQL 4.0 GCS and PedsQL 3.0 NMM in people with DMD, neither provided satisfactory results (Table  4) [66] meant that the quality of evidence supporting this conclusion was graded as high.

Quality assurance of the review
The quality of this review was self-assessed against a newly derived COSMIN checklist [22], designed to evaluate the quality of systematic reviews of healthrelated PROMs. The results are displayed in Additional file 3.
In general, the review meets numerous quality indicators as defined by the COSMIN team, including the elements included in the research aim, search strategies, article selection, and assessment of measurement properties and quality. In a couple of instances, criteria have been partly met. For example, in this review all instruments were included where a validated English copy was freely available for review. It is possible that additional instruments could have been included if licenses were paid for to access the relevant PROMs and development materials, and this could be considered a limitation. Second, citation tracking (i.e. reference checking) was conducted on the final set of articles eligible at Stage 2  [118], not re-extracted or re-rated in this review, based on COSMIN guidance [13]. PROM = patient reported outcome measure; COSMIN = COnsensus-based Standards for the selection of health Measurement INstruments

Discussion
In this systematic review, the published scientific evidence on the content and structural validity of PROMs used to measure at least one aspect of QoL in people with DMD was thoroughly evaluated. The overriding theme was one of sparse evidence. Many PROMs that are being used to assess aspects of QoL in people with DMD are being utilised without the accompanying good quality evidence that supports their validity for this task.
Only five of the PROMs uncovered in this review were specifically designed for use in people with neuromuscular problems (three for DMD), and only two of these have had their content and/or structural validity independently assessed in this population (with the content validity studies involving translated versions). When the evidence is available, most of it is either of a low quality, featuring insufficient detail in the published articles to make thorough and comprehensive assessments of content and structural validity as demanded by COSMIN [16], leading to doubtful ratings. Indeed, one of the highest quality pieces of evidence reviewed in terms of reported methodology, Landfeldt et al. (2018) [66], reported insufficient structural validity of the PedsQL 3.0 Neuromuscular module (NMM) in DMD.
The results from the review may not be viewed as surprising. Many of the PROMs identified are what could be described as "legacy" measures. They were developed at a time when the science of construct and item generation was largely overlooked. The content of instruments was largely defined by clinical or expert opinion, with little explanation of what that entailed. The reporting of  such stages in publications or questionnaire manuals was not commonplace. The transparency of reporting on the early stages of PROM development has only gained traction in the last decade or so. Whilst this is a positive step for researchers, clinicians and users alike, progress can be limited by journal restrictions on word count and remit. It is however possible for such legacy measures to be appropriately validated (or have their validity assessed) in properly designed studies assessing content or structural validity in modern samples of people with DMD. The problem observed in this review is that researchers are likely using such measures as a consequence of precedent or tradition, rather than a supportive evidence base. Another related legacy issue within PROM development, which this review touches upon and has changed for the better over time, is a recognition of the importance of direct patient involvement in developing PROMs [119,120]. In this review, almost half (11 out of 26) of the PROMs did not demonstrate any evidence of patient involvement in their development. While most of these PROMs are legacy measures, this is a noteworthy figure, given that patient involvement is the only way to ensure a PROM is capturing health and QoL outcomes in a way that is relevant, comprehensive, and comprehensible to the patient population [119]. The use of patient involvement in PROM development is thus advantageous for researchers and patients alike. To help guide PROM developers, a recent framework has been published to help researchers fully incorporate patient and public involvement (PPI) in the development of PROMs moving forward [120].
In the current review, some PROMs performed better than others under COSMIN assessment. First, the KIDS CREEN instrument (all versions) does show some evidence of applicability given that it covers many aspects of QoL. The PROM development study for the KIDS CREEN instrument was the only one rated as adequate, it was designed to assess QoL in children and adolescents with chronic illnesses, and the ratings for the content validity of the measure were positive (based on the available evidence in the measure's development). However, it must also be borne in mind that there is little or no direct evidence to support the content or structural validity in DMD, specifically. The original KIDSCREEN instrument (52-item version) was designed to assess multiple aspects of QoL, namely: physical well-being; psychological well-being; moods and emotions; self-perception; autonomy; parent relation and home life; financial resources; social support and peers; school environment; and social acceptance (bullying), covering much of the CMQM framework [9]. The conceptual framework of the instrument is thus intuitively applicable to the Duchenne community; however the measurement of impact may be limited due to the target age range of the PROM itself (8-18 years). While this is not uncommon (i.e. differences in measuring QoL from child to adulthood), there is some question of the applicability for the broader DMD population given the lower age target.
The second-best performing PROM in this review was the LSIA, which received a satisfactory score for relevance, comprehensiveness, and comprehensibility in terms of content validity, based on the information available and reviewers' ratings of the PROM itself. However, the development study for this paper lacked key details necessary in good PROM development, and thus was rated as doubtful. Furthermore, while the LSIA was one of few measures to feature a content validity study, it was a cross-cultural adaptation study of a Brazilian version of the measure, and the results of the formal assessment of this study were doubtful. While the measure is comprehensive, it only comes in a 45-item version, which is potentially quite burdensome. Furthermore, the measure is designed for use in children and young adults only, and may not generalise to adults with DMD.
The most recent PROM developed specifically for use in children and adolescents with DMD was the MDCH ILD. Although the PROM is designed to measure "health-related priorities" [8], much of the content maps onto the CMQM framework [9] and thus covers QoL. While the MDCHILD had many commendable strengths in PROM design, the overall rating of the PROM development, based on the COSMIN worst score counts system [25], was rated as doubtful due to lack of details reported in the development papers. For example, it was unclear if skilled interviewer(s) were used; to what degree data was coded independently; and to what degree, if at all, at least two researchers were involved in the data analysis. This led to a low quality of evidence. Further, because the target population of interest was not clearly defined (i.e. age ranges were not specified), despite performing well in other areas, the PROM received an inconsistent rating for relevance. These results speak to the potential harshness of a worst score counts system advocated by COSMIN, which we discuss further below. Further, because the PROM is new, there is a lack of published content validity studies that may improve the quality of evidence for the MDCHILD going forward, such as that contained in a non-peer-reviewed thesis [100], not eligible for inclusion in the current review.
The PedsQL and associated modules were the most commonly used out of all the PROMs identified within the review. It should be noted that the development studies of the PedsQL were rated as doubtful. There was little evidence to support the content validity of the neuromuscular module of the PedsQL 3.0 (NMM). Furthermore, the psychometric properties of the NMM were not well supported by Landfeldt et al. (2018) [66]. The inclusion of PedsQL within clinical practice, cohort studies or pragmatic trials in DMD thus appears to be based upon precedent and common use, rather than published empirical evidence of suitability, based on content and structural validity. A notable advantage of the PedsQL (and its derivatives) is the young child (via proxy report), child (self and proxy report), young adult forms (self-report), and adult forms, which have now been developed. A further consideration is that the PedsQL scales are designed to be used in parallel (e.g. the generic core scales with the NMM or DMD modules), but were assessed individually under COSMIN guidance. Thus comprehensiveness may be improved by using these scales together.
The search identified some PROM instruments that we were unable to obtain. Access to the PROM and/or associated development papers was limited due to licensing requirements, and therefore it was not possible to include these instruments within the review. It is unlikely that these instruments are commonly used within research and/or clinical practice due to the difficulties around access. Their suitability for the DMD population cannot formally be determined; however, their use is likely to be limited by a lack of accessibility derived from license restrictions, reflected in the few citations in which they appeared.
This review adopted guidance developed by the COS-MIN initiative, and has adhered to their recommended methods in identification of evidence, data extraction, data assessment and data synthesis. Whilst the appropriateness of these robust methods cannot be questioned, this has resulted in relatively low ratings of the PROMs included within the review. It is important to recognise that this does not suggest categorically that the instruments used within published and/or current studies are not appropriate or fit for purpose; content and structural validity only form one component of PROM suitability within a population. Furthermore, as stated, many of the instruments were developed at a time when instrument development methods and procedures were not reported that is not to say the development of the instruments is flawed, just that an assessment of them cannot be made. The COSMIN appraisal tools assume a worst score counts system for the rating of the methodological quality of studies [25]. This means that, in theory, a study could be rated as very good or adequate on all but one criteria, on which it is rated as doubtful or inadequate, and the overall score is thus reduced to the latter lower-quality rating. Sometimes this can be because key details, such as whether skilled interviewers were used, are not reported.
This review is not without its limitations. While the methodological approach of the review is robust and follows the recommendations of COSMIN and that of other published reviews, it must be acknowledged that the rating criteria of the PROMs identified can be viewed as harsh. The COSMIN approach encourages researchers and reviewers to critically appraise evidence of PROM developmenthowever the presence of evidence within published literature is sparse. That is not to say that the development phases did not occur, merely that they are not reported and/or not reported in sufficient detail as required by COSMIN assessment. To critique a PROM's applicability using this criterion could be perceived as being unduly critical; more recent PROMs tend to report the early stages of instrument development, and we are assessing all PROMs by modern standards. Similarly, the descriptions of PROMs themselves are often lacking. Basic information such as number of items, recall period, domain structure and scoring procedure were noted to be sporadically reported, although better in recent literature. The COSMIN-recommended reviewer rating of the identified PROMs for suitability for DMD (as reported in Table 4) has a large subjective component. Whilst this was completed as per the COS-MIN guidelines (with two reviewers and discrepancies reconciled following discussion), some of the ratings are at risk of bias based on the team of raters (i.e. QoL researchers). For example, it is not known whether similar ratings of suitability would be achieved if reviewed by an individual with DMD, a family member or carer of a person with DMD, or a clinician, and we recommend that PPI is incorporated in future COSMIN reviews of content validity. This is further exacerbated when we consider what QoL isfor the purpose of this review it was a multidimensional construct, PROMs that measure a subset of interest (such as depression) may be appropriate to include within studies as part of a host/suite of measures.
The focus of this review was to report on the content and structural validity of PROM instruments that have been used to quantify the impact of DMD on individuals' QoL. However, content and structural validity only address some aspects of PROM suitability, and further work could be undertaken to formally appraise the instruments described. Other measurement properties, such as psychometric performance, could be considered. Given that DMD is a rare condition, the development and validation of PROMs that measure the impact of the condition on QoL is challenging. The number of participants included within various phases of PROM development and validation will be lower than that of a condition such as diabetes, asthma or eczema. Accordingly, the inclusion of subsidiary samples such as other neuromuscular disorders, may be of interest. However, it is not known how appropriate this would be. It can be postulated that other neuromuscular disorders could imply similar impacts upon QoL, however this has not been explored within the context of this review.

Conclusions
In conclusion, evidence on the content and structural validity of PROMs assessing QoL in DMD is lacking. Accordingly, our first recommendation from this review is for more research into the content and structural validity of QoL PROMs used in DMD, and, if PROMs are found to be insufficient on these criteria, for additional PROM development within DMD. Second, as the result of this COSMIN assessment, without further direct content validation work in DMD, we would provisionally recommend the KIDSCREEN for measuring QoL in children and adolescents with DMD. Nonetheless, we caution that the KIDSCREEN has not been formally validated in samples of people with DMD. Accordingly, more research is needed to definitively support the continued use of KIDSCREEN (and its derivatives) within DMD. Finally, in the absence of further evidence, it is difficult to recommend the routine use of a measure to assess QoL in adults with DMD on content and structural grounds. Instead, the findings of this review support the need for further PROM development, which is able to accurately assess the impact of DMD on QoL.