Measuring bothersome menopausal symptoms: development and validation of the MenoScores questionnaire

Background The experience of menopausal symptoms is common and an adequate patient-reported outcome measure is crucial in studies where women are treated for these symptoms. The aims of this study were to identify a patient-reported outcome measure for bothersome menopausal symptoms and, in the absence of an adequate tool, to develop a new measure with high content validity, and to validate it using modern psychometric methods. Methods The literature was reviewed for existing questionnaires and checklists for bothersome menopausal symptoms. Relevant items were extracted and subsequently tested in group interviews, single interviews, and pilot tests. A patient-reported outcome measure was drafted and completed by 1504 women. Data was collected and psychometrically validated using item-response theory Rasch Models. Results All questionnaires identified in the literature lacked content validity regarding bothersome menopausal symptoms and none were validated using item-response theory. Our content validation resulted in a draft measurement encompassing 122 items across eight domains. Following psychometrical validation, the final version of our patient-reported outcome measure, named the MenoScores Questionnaire, encompassed 51 items, including one single item, covering 11 scales. Conclusion Menopausal symptoms are multidimensional with some symptoms unquestionably related to the menopausal transition. We identified four constructs of importance: hot flushes, day-and-night sweats, general sweating, and menopausal-specific sleeping problems. The MenoScores Questionnaire is condition-specific with high content validity and adequate psychometrical properties. It is designed to measure bothersome menopausal symptoms and all scales are developed and psychometrically validated using item-response theory Rasch Models. Trial registration Approved by the Danish Data Agency (J.nr. 2015–41-4057). Ethics Committee approval was not required. Electronic supplementary material The online version of this article (10.1186/s12955-018-0927-6) contains supplementary material, which is available to authorized users.


Background
Menopause is the cessation of women's menstruation and can be determined retrospectively 12 months after the final menstrual period (FMP) [1,2]. On average, women experience the menopausal transition in their mid-to-late forties [1] and the FMP in their early fifties, with large variations [1,3,4].
Menopausal symptoms differ between cultures and ethnic groups, and also between individuals within a homogenous population [12,13]. Therefore, measuring self-reported menopausal symptoms presents a challenge, and so does the distinction between menopausal symptoms and the symptoms of aging. Several questionnaires regarding menopausal symptoms exist. However, to help women who are bothered by menopausal symptoms it requires a PROM that focuses solely on the bothersome symptoms. Such a PROM must also possess high content validity as well as adequate psychometric properties. Item response theory Rasch models is preferred when establishing ideal measurement psychometric properties such as unidimensionality, invariance (specific objectivity or no differential item functioning), statistical sufficiency and additivity [14][15][16]. The aims of this study are threefold: 1) To review existing questionnaires and symptoms checklists (which we also refer to as questionnaires) measuring bothersome menopausal symptoms, and, if we cannot identify an adequate existing questionnaire from the literature search then: 2) To develop a patient-reported outcome measure (PROM) for bothersome menopausal symptoms with high content validity, and: 3) To validate this new PROM for dimensionality, invariance, known-groups validity, and reliability using modern psychometric methods.

Methods
The study took place in Denmark and was divided into three phases: 1) a literature review; 2) qualitative interviews securing high face and content validity; 3) a validation survey where the draft PROM was distributed cross-sectionally and the data analyzed using classical test theory (CTT) and item response theory (IRT) models, securing high construct validity of the final PROM.

Phase 1:Literature review
A literature search in PubMed, Embase, and the Cochrane Library was conducted at the end of 2014 and early 2015 to identify existing questionnaires encompassing menopausal symptoms. We also consulted gynaecologists and general practice specialists to locate relevant questionnaires. We included questionnaires that contained at least one item referring to a bothersome menopausal symptom. Questionnaires on the quality of life (i.e. no items referring to specific menopausal symptoms) or concerning interference with or reaction to menopausal symptoms were not included. Questionnaires had to be freely available and written in English, Swedish, Norwegian or Danish. To be interpreted as adequate, the identified questionnaires should have high content validity encompassing items that were up-todate, not double-barrelled, or ambiguous. Moreover, the psychometric properties of the questionnaires should be assessed using IRT.
None of the identified questionnaires fulfilled all the above criteria. Therefore, we extracted an item-pool encompassing unique items about solely bothersome menopausal symptoms from the identified questionnaires. The meaningful content of relevant items was identified and assessed for redundancy, double-barrelled items were divided into separate items, and ambiguous items were rephrased. The items' response options were not transferred [17]. The subject matter of these items was translated into Danish ad-hoc by KSL and JB. The unique items were grouped into domains by KSL based on clinical experience and the literature, and these were subsequently reviewed by JB. Any discrepancies were discussed until we reached consensus.

Phase 2: Qualitative interviews
To test the content validity (content relevance and content coverage) and the understandability of the unique items, two group interviews were conducted with women bothered by menopausal symptoms. The group interviews were audio-recorded, they lasted for two hours, and were moderated by KSL and JB. The first part of the interview was an open-ended discussion about bothersome menopausal symptoms. If new themes (suggested domains) were revealed in the discussion, we generated new items covering these themes using the women's verbatim expressions from the audio recordings (see below). These new items were tested in the following group or in single interviews (see below). In the second part of the group interviews, the women were asked to assess if they found the subject matter of the unique items relevant. Items found irrelevant were deleted from the unique item-pool and, in case of lack of content coverage, new items were generated. We subsequently asked the women to which of the stated themes (suggested domain), their symptoms belonged. A draft PROM was created after the first group interview. At the end of the second group interview, the women were asked to complete the draft PROM. The themes (suggested domains), a recall period, and suggestions for response options were discussed. Instructions were tested for understandability.
Some symptoms postulated to be caused by menopause could also be caused by aging, therefore a global item was developed: "Have you, within the past three months, been bothered by menopausal symptoms?", with four response options: "no, not at all", "yes, a bit", "yes, quite a bit", or "yes, a lot". Later, this global item was used to evaluate the association between women with and without bothersome menopausal symptoms and the scales' ability to discriminate between the four groups in the global item: none, mild, moderate, or severe bothersome menopausal symptoms.
The draft PROM was further tested for functionality, understandability, and content validity in four single interviews conducted by KSL. The women included in these interviews were all bothered by menopausal symptoms. A paper version of the draft PROM was tested in the first two interviews and an online draft version was tested in the two final interviews. If any problems were revealed, they were corrected between interviews.
Finally, the online draft PROM was tested for functionality (including the response option) and understandability in four individual pilot tests, followed by a short interview, among women aged 50-64 where two of the women were bothered by menopausal symptoms.
The group and single interviews were audio-recorded and we measured the time taken to complete the PROM. Notes and important citations were listed during the interviews. After each interview the recording was audited by KSL and used when the key issues and results from the interviews were analyzed.

Phase 3: Validation survey and analysis
The final draft PROM was distributed by a link (SurveyXact) in emails, social media (Facebook groups for women), project research homepage [18], general practices, and the women's lifestyle magazine "Liv" [19] (through their online newsletter and Facebook page). Women aged 45-65 years, with and without bothersome menopausal symptoms, were asked to complete the PROM.
Reliability and validity To secure adequate psychometric properties of the final PROM we conducted Rasch analysis on the data collected verifying if items in each suggested domain fitted a partial credit Rasch model for polytomous items [20]. We tested differential item functioning (DIF) [21,22], i.e. if items performed differently depending on the variables: occupation, education, living (living alone), smoking, BMI, age, hormonal intrauterine device, bilateral ovariectomized, hysterectomized, having menstruation within the past year. Local dependence (LD) was also evaluated [15,23], i.e. whether items were correlated beyond what could be expected by measuring the same underlying construct using item screening and log-linear Rasch model tests [24,25]. Where evidence of DIF and/or LD was disclosed, a log-linear Rasch model was considered indicating a scaling solution with desirable measurement properties [14]. Andersen's conditional likelihood ratio test (CLR-χ 2 ) was used to evaluate the overall model fit [26] and individual item fit was assessed by comparing observed and expected rank correlation between the item and rest-score (sum of other items in scale) [27]. Items that demonstrated the most problematic properties and/or poor fit were deleted stepwise from the scales, until fit of the Rasch model was achieved. Items with misfit but high face and content validity it were kept as a single item. Cronbach's alpha was used as a measure of reliability [28,29]. The Benjamini-Hochberg procedure was used to account for multiple testing [30].
The sum-scores of the resulting Rasch-fitting scales (see below) was tested by comparison to the global item. For each of the four categories of the global item the means and standard deviations (SD) of the sum-scores were calculated and compared using ANOVA; also, the order of the means in a sum-score should reflect the order of the categories of the global item. We calculated the number of subjects needed in a hypothetical randomized trial to find, with 80% power, the difference between the means corresponding to the two last categories of the global item in a t-test with a significance level of 5%; low numbers indicate a high discriminating ability. We used SAS v9.4 and DIGRAM v3.05.3 software.
The study was approved by the Danish Data Agency (J.nr. 2015-41-4057). Ethics Committee approval was not required.
These questionnaires had in total 356 items, of which 126 were unique items divided into five domains (Additional file 1: Appendix 1).

Phase 2
The first group interview included five women (aged 50-63 years), and the second included four women (aged 49-59 years).
In the two group interviews 95 (75.4%) of the 126 items were endorsed and 27 new items (five of these due to double-barrelled items) and three new domains were generated (Additional file 1: Appendix 1). In the first group interview it was revealed that hot flushes and dayand-night sweats were experienced as two different things (constructs). Some women were bothered by hot flushes but did not experience day-and-night sweats. Others were bothered by both hot flushes and day-andnight sweats, but described it as different experiences. This was confirmed in the second group interview.
The women agreed on a three-month recall period and preferred the four response options; "no, not at all", "yes, a bit", "yes, quite a bit", or "yes, a lot" (Table 1. Item layout). In the sexual domain it was decided to create an additional response option "I do not know" for respondents not sexually active, with or without a partner. These preferences were later confirmed in the single interviews. By the end of the second group interview no new items or domains were generated.
Women interviewed individually were aged 50-52 and the women who participated in pilot testing were aged 50-64. In these interviews, almost all comments were about linguistic issues or layout suggestions and only one extra item was desired and another item perceived as redundant and deleted. At this point we achieved data saturation. Finally, one woman requested a comment box at the end of the PROM. Table 2 presents the number and age of participants in the interviews. The final version of the draft PROM encompassed 122 items covering 8 domains (Additional file 2: Appendix 2) and took, on average, 10 min to complete.

Phase 3 Survey
Within 48 h 1511 women had completed the draft PROM. Seven completed questionnaires were excluded; six respondents were under the age of 45 years and one respondent had ambiguous and inconsistent responses. The characteristics of the remaining 1504 respondents are listed in Tables 2 and 3.

Psychometric analysis
The analyses revealed eleven uni-dimensional scales fitting a Rasch model. One single item was retained due to high face validity.
The final PROM was named the MenoScores Questionnaire (MSQ) and the eleven scales cover the constructs: hot flushes (HF), 2 items; day-and-night sweats  Table 4.

Vasomotor symptoms
This suggested six-item domain showed misfit. Based on evidence of LD and results from the qualitative interviews, where hot flushes and day-and-night sweats were described as two different constructs, three two-item scales were formed. These scales all fitted a Rasch model and had no evidence of LD and were named: hot flushes (HF), day-and-night sweats (DNS) and general sweating (GS).
In the HF scale, item 4 (hot flushes during the day) showed DIF with respect to (wrt.) having menstruation within the past year (p = 0.0013), and item 5 (hot flushes during the night) showed DIF wrt. BMI (p = 0.0008). In the DNS scale, item 6 (sweats during the day) showed DIF wrt. BMI (p < 0.0001), and item 7 (night-sweats) showed DIF wrt. Having menstruation within the past year (p = 0.0045). In the GS scale there was no evidence of DIF.

Sleep
The suggested 10-item domain did not fit a Rasch model. A two-item menopausal-specific sleeping problems

Abdominal (ABD) scale
This 10-item scale was rejected, but a 4-item scale comprising the items 77, 96, 98, and 102 was found to fit a log-linear Rasch model. In this scale, LD was found between item 77 (nausea) and item 98 (uncontrollable loss of gas). Item 96 (bloated stomach) showed DIF wrt. Age and item 98 showed DIF wrt. Education.

Urinary-vaginal (URIN) scale
The 6-item scale was rejected, but a 4-item scale comprising the four items 106, 107, 108, and 110 was obtained. Item 108 (urine smells different) showed DIF wrt. Smoking and LD was found between item 106 (need to pass urine more frequently) and 107 (sometimes leak urine), and between 108 (urine smells different) and 110 (vaginal discharge has been different).
Item 91 (more tired than usual) did not fit any of the scales. The item was also tested with the MSSP scale but without a fit to a Rasch model. Finally, the item was tested with the three related items 92, 93, and 94 but they did not fit a Rasch model. Nevertheless item 91 was retained as a single item because of its high face validity.

Sexual
Four items (115, 116, 117, 118) from this domain fitted a Rasch model and were named the sexual (SEX) scale. LD was found between the items 115 (pain during intercourse) and 116 (bleeding after intercourse). Item 115 showed DIF wrt. Age and being bilaterally ovariectomized and item 116 showed DIF wrt. Having a hormonal intrauterine device and having menstruation within the past year; while item 117 (too tired for sex) showed DIF wrt. Living alone.
The SH, ABD scales showed signs of dichotomization in the category probability curves. The SH, ABD and SEX (with the additional response option "I do not know") scales were re-tested in three single interviews (with women age 50 to 65) and all women preferred the three-response option instead of four ("no, not at all", "yes, a bit", or "yes, a lot", plus the additional option in the SEX scale). In order to optimize model fit, the response options in these scales were reduced to the three options above (including the addition option in the SEX scale).

Work and spare time
Two-thirds of respondents were asked to complete this domain (i.e. women who claimed to be bothered by menopausal symptoms by answering "yes" to the global item). The 3-item domain fitted a Rasch model (p = 0.117) but items 1 and 3 with extremely poor item fit (p = 0.0001) and (p = < 0.0001). Thus, we decided to exclude this domain from the final PROM.

Menstruation
Only women who had menstruated within the past year were asked to complete this domain (approximately half of the respondents) ( Table 3). This suggested 3-item domain did not fit a Rasch model (p = 0.000) and the items were not included in the final PROM.

Association (discrimination)
The HF, DNS, GS, and MSSP scales showed best performance in discriminating between the response options of the global item ( Fig. 1. HF, DNS, GS, MSSP scales). The discriminating ability is presented in Table 5.

Reliability
The reliability of the scales was moderate to high with Cronbach's alpha values between 0.60 and 0.91 (Table 5). Table 4 presents individual item fit and Table 5 presents fit statistics, Cronbach's alpha, and discriminating ability.

Discussion
We found that all existing questionnaires lacked content validity regarding bothersome menopausal symptoms and none were validated using IRT. Moreover, they all   regarded hot flushes and day-and-night sweats as a single construct, which this study could not confirm. We found that the suggested vasomotor domain was threedimensional concluding that hot flushes and day-andnight sweats are two different constructs. This was revealed in the qualitative interviews and confirmed by the Rasch analysis. Furthermore, these findings were confirmed when screening potential participants for a current randomized controlled trial (RCT) [48]. This study also revealed that only some symptoms are unquestionably related to the menopausal transition and four constructs are of importance when measuring bothersome menopausal symptoms: hot flushes, dayand-night sweats, general sweating and menopausalspecific sleeping problems. A strength of this study is the combination of rigorous qualitative and quantitative processes. Through the qualitative interviews we secured high content validity. Subsequently we used Rasch models to assess if the suggested domains behaved psychometrically as we expected. Another strength is the assessment of discriminating ability. Using the responses to the global item, in relation to the responses to the remaining items, we assessed how well the individual scales within the MSQ discriminated between the response options of the global item. We found the HF, DNS, GS and MSSP scales performed best in discriminating. Our interpretation of this is that only these constructs (HF, DNS, GS, MSSP) are unquestionably related to the menopausal transition. Many other symptoms may be, more or less, caused by aging.
A limitation could be that as the data was collected cross-sectionally, test-retest analysis is not reported. Women with bothersome menopausal symptoms report fluctuations in their symptoms from day-to-day. Therefore, a test-retest with a 2-week interval would not be meaningful. Instead we assessed the internal consistency of the scales using Cronbach's alpha. A further limitation is the broad sampling procedure which makes it difficult to know exactly what population the sample is representative of, due to the element of self-selection inherent in survey data using web-based enrolment. The fact that Rasch validation is performed without distributional assumptions mitigates this challenge.
We identified DIF and LD in some of the final scales which may limit MSQ's applicability in some situations. Items 4 and 5 from the HF scale and items 6 and 7 from the DNS scale all possessed DIF. Nevertheless, these items were maintained because of their high face validity. If the developed scales are used in a RCT, DIF is far less problematic because any exogenous variables will presumably be equally distributed among the randomized  groups. However, if the scales are used in non-randomized studies, and any exogenous variables that can cause DIF appear in the studied cohort, one should adjust for the magnitude of the identified DIF [22]. Another approach would be to refrain from items possessing DIF or refrain from using the scales encompassing items possessing DIF [22]. Scales with many items may be preferred, since many items in a scale could increase the sensitivity, specificity, reliability, and ability to discriminate between the groups being tested. In the present study, our interest was to assess if the women were "not at all", "a bit", "quite a bit", or "a lot" bothered by menopausal symptoms. We found the best discriminating scales among four 2-item scales: the vasomotor and sleeping scales (HF, DNS, GS and MSSP) and not among scales encompassing more items. There could be two reasons for this lack of discrimination: 1) LD, but even after deleting items with LD, these scales still did not discriminate as well as the scales from the vasomotor and sleeping domains; 2) that the subject matter of the other scales is related more to aging than to menopause.
Due to the large item-pool we identified, we could discharge problematic and poor fitting items using a stepwise procedure. However, we ensured that no important items were lost just because of a psychometric misfit. Therefore, items with high content validity but psychometric misfit were kept as a single item, e.g. item 91 (more tired than usual).
Even though the "work and spare time" domain fitted a Rasch model the items showed poor item fit. Since these items were not symptoms in themselves, but referred to how menopausal symptoms affected women's work and spare time, we decided to disband these items and omit this domain from the final PROM. Moreover, the 3-item menstruation domain did not fit a Rasch model, and as these items were not of high relevance to this study, they were excluded from the final PROM.
Since the timing of menopause and the experience of menopausal symptoms vary so widely [1,4], the MSQ is designed to measure self-reported bothersome menopausal symptoms both in peri-and post-menopausal women. The intention is for the MSQ to be used as an outcome measure in studies where women are treated for bothersome menopausal symptoms. The time needed to complete the MSQ is estimated at 5 min, as the MSQ contains fewer than half the items in the draft PROM.
The MSQ only addresses bothersome menopausal symptoms since these would be the target for treatment. It is important to note that some women also have positive experiences in relation to the menopause [49]; however, this is beyond the scope of the present study. The MSQ was developed in Danish and any new language or modified version may need an additional validation study to secure adequate measurement properties.