Rasch analysis of the Meaning in Life Questionnaire among adults from South Africa, Australia, and New Zealand

Background Meaning in life is a key indicator of subjective well-being and quality of life. Further developments in understanding and enhancing the construct will depend inter alia on the sound measurement thereof. This study is at the forefront of applying modern psychometric techniques to the Meaning in Life Questionnaire, a scale widely used to assess meaning in life. Method The Rasch rating scale model was applied to the Presence and Search subscales of the Meaning in Life Questionnaire using a sample of 601 adults from South Africa, Australia, and New Zealand. Results The Presence subscale was insensitive at high levels of presence of meaning while the majority of the respondents fell in that range. Removal of item 9 (“My life has no clear purpose”) and collapsing the response categories indicative of low and medium levels of the latent construct significantly improved the subscale’s targeting and fit to the Rasch model, resulting in a subscale that exhibited differential item functioning on items 1 (“I understand my life’s meaning”), 4 (“My life has a clear sense of purpose”), and 5 (“I have a good sense of what makes my life meaningful”) for country, but none for gender, age group, or education level. The Search subscale yielded disordered category threshold calibrations, but after collapsing some of the response categories representing low and medium levels of the target construct, a subscale that demonstrated good fit to the Rasch model, good targeting, and no differential item functioning resulted. Conclusions In terms of this particular scale, adaptation of the rating scale and removal of item 9 is recommended. Country-level parameter estimates may be needed for items that exhibited differential item functioning. The study also has significant implications for the theory, measurement, and practice of meaning in and quality of life in general. Reasons for and the far-reaching implications of the insensitivity of the Presence subscale for high levels of presence of meaning on, for example, the correlation between meaning in life and indicators of health are contemplated. Further investigation of the construct’s nature and measurement, especially at high levels, is indicated.


Background
Quality of life involves an evaluative judgement of an individual's physical, cognitive, emotional, and social functioning and can be based on subjective (self-report) and/ or objective (independent sources of information) indicators [1,2]. Although quality of life research traditionally focused on situations and factors that undermine or endanger quality of life, recent research has increasingly stressed the importance of incorporating positive constructs, such as subjective well-being, positive emotions, and character virtues and strengths in the conceptualisation and study of quality of life [2,3]. One of the key constructs that is widely considered an integral part of a life well-lived and quality of life is meaning in life [4][5][6][7]. A myriad of studies have explored the relationship between meaning in life and mental well-being, as well as psychopathology [8]. Also, the association between meaning in life and health-related quality of life has been established in multiple studies [9].
In order to study meaning in life and its quality of life concomitants, the construct has to be conceptualised theoretically. Different models are used in the literature to conceptualise this complex phenomenon, for example those of Wong [10], Schnell [11], and Steger [12]. Steger's model differentiates between presence of meaning, which involves "the extent to which people comprehend, make sense of, or see significance in their lives, accompanied by the degree to which they perceive themselves to have a purpose, mission, or overarching aim in life" [12], and search for meaning, which refers to "the strength, intensity, and activity of people's desire and efforts to establish and/or augment their understanding of the meaning, significance, and purpose of their lives" [13].
Theoretically and empirically sound measurement instruments that assess meaning in life are crucial for the rigorous study of the construct, to understand its associations with psychological well-being and psychopathology, and to assess the impact of interventions targeting meaning in life. Various models of meaning have been operationalized in self-report questionnaires (see [14] for a systematic review of these measures). One such scale that is widely used and recognized for its outstanding psychometric properties [14] is the Meaning in Life Questionnaire (MLQ) [15], which operationalizes Steger's [12] model of meaning in life. Steger et al. [15] showed that the scale, which consists of two subscales corresponding to the theory, namely Presence of Meaning (MLQ-P) and Search for Meaning (MLQ-S), demonstrated sufficient internal consistency and test-retest reliability, as well as structural, convergent, and discriminant validity in three American student samples.
Since the initial development study of the MLQ [15], which utilised only data from American student samples, good psychometric properties of the scale have been shown in a number of other contexts, cultures, and translations. For example, validity and reliability were shown for the English version of the scale among a web-based survey of adults [16], an American sample of people diagnosed with serious mental illnesses in an inpatient setting [17], and in a multi-cultural South African student setting [18]; for the Japanese translation of the scale among a Japanese student sample [19]; for the Spanish translation of the scale among a Spanish student sample [20]; and for the Turkish version of the scale among a combined college student and adult community sample [21].
Even though the MLQ is widely appraised to possess good psychometric properties [14] and the measure has been found to function well across age groups [13] and cross-culturally [13,21,22], the scale has, as far as we are aware of, never been evaluated from an item response theory (IRT) perspective. IRT provides a modern and reputedly superior alternative to classical test theory, as it discriminates more finely among different sources of error, especially regarding features of individual items that may influence their performance [23]. The family of IRT models share the assumption that the probability of a respondent endorsing any particular item is considered to be a function of the respondent's level on the underlying latent variable that is measured and the characteristics of the item [24].
The Rasch model, specifying only one parameter to characterize each item (item difficulty), is the simplest IRT model and was developed by the Danish mathematician, Georg Rasch [25,26]. Unlike in other IRT models and classical test theory techniques where the intent is to find a model that best fits the data, the Rasch model requires the data to fit the model in order to yield objective measurement [27]. The Rasch model postulates that useful measurement involves a unidimensional construct increasing or decreasing monotonically along an interval scale [28]. Rasch modelling provides a method to transform ordinal data (e.g. data from Likert-type items) into continuous, equal interval units (logits), which allows for the summation of the items' raw scores, where the summed raw score is a sufficient statistic [29,30]. Rasch analysis can be used in scale development, for example by reviewing the functioning of the response categories, the unidimensionality of the scale, and the targeting of the measure [31]. Moreover, Rasch modelling can be used to investigate differential item functioning (i.e., when different demographic groups responded differentially to an item despite equal levels of the latent construct), thus enhancing the assessment of item-level cross-cultural invariance of measurement scales [32].

The present study
In the present study, the Meaning in Life Questionnaire [15] was examined against the assumptions of the Rasch model. This is the first known study where the scale is analysed using an item response theory (in particular, Rasch modelling) approach. By applying the Rasch model, we explored the unidimensionality of each subscale, the functionality of the response categories, and how well the sample was targeted by the scale. We also examined differential item functioning (DIF) of the scale for a range of demographic variables.

Participants
The sample (N = 601) consisted of about equal sized groups of adults from South Africa, New Zealand, and Australia, who all completed the original English version of the MLQ as part of a battery of scales used in the international Eudaimonic and Hedonic Happiness Investigation (EHHI) project [33]. Participants were selected to be fluent in English, have at least secondary education, and be between 30 and 60 years of age. The aim was to factorially cross gender, age (three age groups of 30-39 years, 40-49 years, and 50-60 years), and education level (secondary and tertiary education). The sociodemographic profile of the sample is summarised in Table 1.

Socio-demographic questionnaire
Demographic information of each participant, including country of residence, gender, age group, and education level, was obtained.
Meaning in Life Questionnaire (MLQ) [15] The MLQ comprises two subscales that was developed to be relatively independent: Presence of Meaning (MLQ-P) and Search for Meaning (MLQ-S) [15].
Responses to 10 statements are provided on a rating scale with response options 1 = Absolutely Untrue, 2 = Mostly Untrue, 3 = Somewhat Untrue, 4 = Can't Say True or False, 5 = Somewhat True, 6 = Mostly True, and 7 = Absolutely True. In the original validation study among American students, the scale exhibited good internal consistency and test-retest reliability, as well as structural, convergent, and discriminant validity, with the Cronbach's alpha values of the Presence subscale varying between 0.82 and 0.86 and for the Search subscale between 0.86 and 0.87 [15]. Good internal consistency reliability was found in South African student [18], New Zealand adult [34], and web-based Australian samples [35], with alpha-values of .85, .90, and .88, respectively, for the MLQ-P, and .94, .91, and .92, respectively, for the MLQ-S.

Procedure and ethical considerations
A mixed-methods cross-sectional survey design was used, where participants responded to open-ended questions related to happiness, meaning in life, and goals, and completed a battery of quantitative measurement scales. For the current investigation, only responses to socio-demographic questions and the MLQ were used. In order to avoid the potential complications of missing values and imputation techniques in Rasch analyses, respondents who generated missing values on the MLQ were removed from the sample. This involved 15 participants from South Africa, whose removal was justified by the fact that the original South African sample was larger than the samples from Australia and New Zealand. The sample from New Zealand contained no missing responses, and for the Australian sample six respondents were removed. Ethical approval was obtained from the respective regulatory ethics committees in each country. Participants were recruited by research leaders within each country using poster and newspaper advertisements and the snowball-method. Participants were provided with information on the study prior to voluntary participation.

Data analysis
Data were analysed using the Rasch rating scale model [25], which assumes that the distances between the thresholds of polytomous items (i.e., the probabilistic midpoints between adjacent response categories) are equal across all items. The Winsteps® 3.81 software [36] was used for all analyses, except for the graphical presentation of the person-item threshold distributions (Fig. 2), which was obtained from RUMM2030™ [37]. The MLQ-P and MLQ-S were evaluated separately, since the scale was designed to yield two relatively independent subscales [15]. Since no single aspect of Rasch analysis is definitive in identifying the optimal data- model relationship, multiple tests and graphical representations should be used to examine the characteristics of the items and persons [30]. The following interrelated facets of Rasch analysis should be considered simultaneously to inform decisions.

Person and item separation and reliability
Person separation and reliability indices indicate how well one can discern persons along the measured variable [28] and values larger than 2 and 0.8, respectively, imply that the items are sensitive enough to differentiate two levels of persons according to their level of intensity on the construct (high and low scorers) [38]. Item separation and reliability indices are indicative of the capacity of the instrument to define a unique hierarchy of items along the measured construct [28] and values larger than 3 and 0.9, respectively, suggest that the sample is large enough to confirm the item challenge order (on three levels of item challenge) [38].

Unidimensionality and local independence
According to the Rasch model, useful measurement is obtained when a unidimensional construct is measured by locally independent items [30]. In terms of unidimensionality, item infit or outfit mean square statistics smaller than 0.6 can be indicative of overfit, and values larger than 1.4 of underfit when the rating scale model is used [28]. The point-biserial correlation of an item indicates whether higher scores on the item correspond with higher levels of the underlying construct and positive values are expected [38]. In addition, lack of unidimensionality may exist when the eigenvalue of the first contrast in a Rasch principal components analysis of the residuals (PCA-R) (i.e., the first component after the Rasch component has been removed) is larger than 2.0, and when the variance explained by the Rasch component is small (e.g., < 40 %) [38]. Correlations between the residuals of item pairs of around 0.7 are indicative of high local dependence, while correlations around 0.4 are considered to be low [38].

Response category functioning
Rasch analysis enables the researcher to investigate how the respondents used the rating scale so that scale developers can decide on the optimal number and combination of rating scale categories [31,39]. This task can be accomplished by examining how the data fit the Rasch model after response categories were collapsed. Bond and Fox [28] provided guidelines in this regard, including that the collapse should make intuitive sense and that the ideal is to create a uniform frequency distribution over the categories with each category containing at least 10 observations. Also, the average measures of the categories and the category threshold estimates should increase monotonically, with the category threshold estimates having steep gradients (at least 1.4 logits, but no more than 5.0 logits) to ensure that each category represents a distinct portion of the latent variablethis can also be investigated graphically by looking at the category probability curves. Lastly, the infit and outfit mean square statistics of each response category should be less than 2.0.

Targeting
Rasch analysis can be used to detect gaps in the continuum of the measured construct by identifying poor targeted items or persons, such as items for which there is an insufficient number of persons with an intensity level comparable to the item challenge 1 , or persons for which there is an insufficient number of items with a challenge level comparable to the person's intensity [40]. This goal can be attained by examining the person-item threshold distributions generated by RUMM2030™, which offers a visual comparison of the distribution of the person intensity levels (top part of the graph) and the item challenge levels (bottom part of the graph) along the latent trait continuum, with the information provided by the items also mapped onto the person distribution.

Differential item functioning
Rasch analysis can assist in identifying differential item functioning (DIF), which occurs when different groups of people within the sample responded in a different way to an item despite equal levels of the construct that was measured. In this study, uniform DIF [31] was investigated for country, gender, age group, and education level. The degree of DIF was assessed by comparing p-values from the polytomous version of the Mantel-Haenszel statistic [41,42] against a Bonferroni-corrected 5 % significance level, as well as the DIF Contrast, which is indicative of moderate to large DIF when it is larger than or equal to 0.64 [38].

Results for the presence subscale MLQ-P
Although the MLQ-P yielded person and item separation and reliability indices that were in line with the guidelines and the results from the PCA-R suggested sufficient unidimensionality and local independence of the items (see Table 2), item 9 ("My life has no clear purpose") showed misfit based on its infit and outfit mean square statistics (see Table 3). Also, response category 1 (Absolutely untrue) exhibited a low frequency and misfit based on its outfit mean square statistic (see Table 4). Although the average measures and threshold calibrations increased monotonically as the categories increased, the threshold calibrations were close to each other, indicating that categories 2 (Mostly untrue), 3 (Somewhat untrue), and 4 (Can't say true or false) were the most likely to be endorsed on only a small portion of the latent construct (see Table 4 and Fig. 1). From the person-item threshold distribution ( Fig. 2) it was clear that the person intensity was in general higher than the item challenge, indicating that the scale exhibited poor targeting for persons with high levels of the latent construct. The MLQ-P showed DIF for country on items 1 ("I understand my life's meaning"), 4 ("My life has a clear sense of purpose"), and 9 ("My life has no clear purpose"), as depicted in Table 6. There was no significant DIF for gender, age group, or education level.
In an attempt to remedy the problems highlighted for MLQ-P, all possible combinations of response category collapses were explored, but none of the collapses resolved the problems with item 9. Therefore the next step was to remove item 9, resulting in a 4-item scale (hereafter labelled MLQ-P-4).

Results for the MLQ-P-4
The person and item separation and reliability indices improved significantly after item 9 was dropped from the scale (see Table 2). The PCA-R yielded results that    The original item 9 was reversed in these analyses confirmed satisfactory unidimensionality and local independence (Table 2) and all point-biserial correlations (values ranged between .79 and .85) and item infit and outfit mean square statistics (Table 3) pointed towards good fit. Although none of the response categories showed misfit based on their infit and outfit mean square statistics, the category probability curve (not shown) and threshold calibrations (see Table 4) still revealed that response categories 2 (Mostly untrue), 3 (Somewhat untrue), and 4 (Can't say true or false) were the most likely to be endorsed over only a small portion of the latent variable, suggesting redundant response categories. Category 1 (Absolutely untrue) also still generated a low frequency. The person-item threshold distribution (not displayed) suggested even worse targeting for persons with high levels of the latent construct when compared to the full MLQ-P. The MLQ-P-4 showed DIF for country on items 1 ("I understand my life's meaning"), 4 ("My life has a clear sense of purpose"), and 5 ("I have a good sense of what makes my life meaningful") as depicted in Table 6. No significant DIF was found for gender, age group, or education level.
In order to address the redundancy of the response categories, the next step was to explore all possible combinations of category collapses.

Results for the MLQ-P-4, response categories collapsed
Based on Rasch model diagnostics, two combinations of category collapses produced superior performance: One where category 1 (Absolutely untrue) was collapsed with category 2 (Mostly untrue), and category 3 (Somewhat untrue) with category 4 (Can't say true or false)hereafter labelled MLQ-P-4 1122345; and one where categories 2, 3, and 4 were collapsedhereafter labelled MLQ-P-4 1222345. For both, the separation and reliability indices and the results from the PCA-R were in line with the results before collapsing categories (see Table 2). Due to space limitations only the results of the MLQ-P-4 1122345 are displayed in Tables 3, 4, and 5, and Figs. 1 and 2. Results for the MLQ-P-4 1222345 were similar, unless indicated in the text. The item infit and outfit mean square statistics (Table 3)    well and the response categories showed good fit, with threshold calibrations increasing monotonically and being sufficiently distanced from each other (see Table 4 and Fig. 1). For the MLQ-P-4 1222345, the frequency of category 1 (Absolutely untrue) was low, while the MLQ-P-4 1122345 yielded a larger frequency for category 1. Collapsing the categories improved the targeting of the scale considerably (see Fig. 2). Both the MLQ-P-4 1122345 and the MLQ-P-4 1222345 showed DIF for country on items 1 ("I understand my life's meaning") and 5 ("I have a good sense of what makes my life meaningful") as shown in Table 6. No significant DIF was found for gender, age group, or education level.

Results for the search subscale MLQ-S
The separation and reliability indices for the MLQ-S were in line with the guidelines, and the results from the PCA-R pointed to sufficient unidimensionality and local independence (see Table 2). Considering the item infit and outfit mean square statistics (Table 3) and the point-biserial correlations (values ranged between .80 and .85), all items fitted the Rasch model well. Although the infit and outfit mean square statistics of the response categories adhered to the guidelines, the threshold calibrations of categories 2 (Mostly untrue), 3 (Somewhat untrue), and 4 (Can't say true or false) were disordered, pointing towards problematic use of the rating scale (see Table 5), which is also evident in the category probability curve (Fig. 1). The person-item threshold distribution (Fig. 2) portrayed that the average item challenge was slightly lower than the average person intensity, but from the information curve it is clear that there was substantial information available for the majority of respondents. There was no significant DIF for country, gender, age group, or education level. In an attempt to remedy the disordered threshold calibrations, all possible combinations of response category collapses were explored.

Results for the MLQ-S, response categories collapsed
Based on Rasch model diagnostics, two combinations of category collapses stood out as superior: One where Although the item separation dropped slightly after collapsing the categories, the person separation increased and the person and item reliability indices remained unchanged (see Table 2). Results of the PCA-R suggested sufficient unidimensionality and local independence (see Table 2). Based on the item infit and outfit mean square statistics (Table 3) and the point-biserial correlations (values ranged between .82 and .89 for MLQ-S 1223345 and between .82 and .88 for MLQ-S 1233456), all items manifested adequate fit. The problem of disordered category thresholds has been resolved, the distances between the threshold calibrations have improved, and the infit and outfit mean square statistics of the response categories pointed towards satisfactory fit (see Table 5 and Fig. 1). The person-item threshold distribution (Fig. 2) suggested improved targeting for MLQ-S 1233456, but for MLQ-S 1223345 (not shown) the average item challenge level was found to be more than the average person intensity level, which suggests less optimal targeting. There was no significant DIF for country, gender, age group, or education level. The MLQ-S yielded disordered category threshold calibrations, but after collapsing some of the response categories representing low and medium levels of the target construct, a scale that demonstrated good fit to the Rasch model, good targeting, and no DIF resulted. Several specific aspects of the results will now be discussed.

Reversed item
The first significant finding that warrants discussion is the poor performance of item 9 ("My life has no clear purpose"), the only reversed item in the MLQ-P scale. In a review on misresponse to reversed and negated items, Weijters and Baumgartner [43] advocated for the inclusion of reversed items in measurement scales as it can provide many benefits (e.g., control acquiescence, disrupt careless responding, and promote a broader coverage of the content domain), but stressed that it should be done with caution. A reversed item that is merely the negation of an item in the main direction (in point of fact, item 9 is basically the negation of item 4, "My life has a clear sense of purpose"), does not hold the benefit of broadening the content domain tapped by the instrument, and has the disadvantages inherent in negated items (e.g., accurately assessing level of agreement with statements that contain negation requires considerable cognitive strain) and reversed items (e.g., cross-cultural differences in response styles such as acquiescence). We therefore follow the guidance provided by Weijters and Baumgartner [43], who advised against the use of negated reversals, and consequently we recommend the removal of item 9, which will result in a 4-item Presence of Meaning subscale. Steger et al. [15] stated that the reversed item was retained in the hope of discouraging automatic response sets. It is our view that this concern is to a large extent already handled by the mixed administration of the Presence and Search subscales. If item 9 is removed, however, the remaining items 4 to 6 will tap presence of meaning and the last three items will tap search for meaning. To guard against careless responding and response sets, we recommend shuffling the last six items (item 9 excluded) so that the respondent does not respond to three items from the same subscale in sequence.

Number of response categories
For both subscales, the response categories indicative of low and medium levels of the latent construct appeared to be redundant and for the search subscale, the category thresholds were disordered. These findings suggest that the respondents were unable to distinguish reliably among the categories, and consequently fewer categories should yield more consistent, reliable scores. Weijters, Cabooter, and Schillewaert [44] suggested that seven response categories may be acceptable for populations who are expected to have high cognitive abilities, verbal skills, or questionnaire experience, such as college    i.e., the difference between the two countries' DIF measures) was larger than or equal to 0.64, the countries are specified in this column, MH if the p-value of the Mantel-Haenszel test was smaller than Bonferroni α, the countries are specified in this column, AU Australian sample, SA South African sample, NZ sample from New Zealand. In columns DIF Contrast and MH, x > y implies that respondents from country x found it significantly harder to endorse the item than respondents from country y given equal levels of presence of meaning a The original item 9 was reversed in these analyses students, but that a 5-point scale may be more appropriate for the general population. For future use, we recommend either a 6-point rating scale where the midpoint category 4 = Can't say true or false is dropped, or a 5point scale with categories 1 = Absolutely untrue, 2 = Untrue, 3 = Unsure, 4 = True, 5 = Absolutely true (the issue of whether to include a midpoint category is much debated in the literature [44,45]).

Targeting
In the present study, the average level of meaning in life captured by the items was substantially lower than the average level of presence of meaning manifested by persons who completed the scale, suggesting poor targeting. In fact, the scale provided little information for respondents with high levels of presence of meaning while at the same time most of the respondents fell within that range. This could have significant practical implications. Correlations in correlational studies will be largely influenced by the minority of people exhibiting lower levels of presence of meaning as reflected by lower scores on the MLQ-P, while nuances of presence of meaning at the higher end of the continuum will not be captured well. This can, for example, influence outcomes of studies where the associations between meaning in life and indicators of health and quality of life are studied significantly. In addition, in experimental studies or studies where intervention programs are evaluated, the MLQ-P would probably not detect changes in meaning in life of people on the higher end of the continuum, which involves the majority of people, as the scale is not sensitive to changes at the higher end of the continuum. Different explanations can be given for the findings regarding the targeting of the MLQ-P. One apparently obvious explanation is that there are not enough items or response options to capture high levels of the presence of meaning continuum and such items or response options should be added. However, given that the questionnaire already allows respondents to rate statements like "I understand my life's meaning" to be "absolutely true", it is not clear what kind of items or response options can be added to capture even higher levels of presence of meaning in life.
Another possible explanation pertains to the nature of presence of meaning as a construct and its distribution in the general population. The fact that the majority of the respondents endorsed high levels of presence of meaning according to their scores on the MLQ-P could simply tell us that most people indeed experience their lives as basically meaningful: Most respondents' level of presence of meaning were higher than the levels where the scale had optimal information, merely because there is not much variability at the upper end of the underlying construct continuum. Such an explanation speaks to the findings of Heintzelman and King [46], who conducted a review of research on meaning in life from epidemiological data and studies using the MLQ-P [15] and the Purpose in Life Test [47]. They found that diverse samples rated themselves significantly above the midpoint on self-report measures of meaning in life and concluded that most people experience their lives as "pretty meaningful". This line of thought can be linked to psychopathology literature where "quasi-traits" are distinguished. Reise and Waller [48] defines a quasi-trait as "a unipolar construct in which one end of the scale represents severity and the other pole represents its absence (depression versus not depressed)" which "is in contrast to a bipolar construct, where both ends of the scale represent meaningful variation (depression versus happiness)". In psychopathology research, the existence of quasi-traits with their associated peaked information curves (with the peaks in the range representing severe levels of the trait) has been found in many item response theory applications and often led researchers to conclude that items needed to be added or adapted to provide information at low (less severe) levels of the trait continuum [48]. According to Reise and Waller [48] this reasoning is problematic when working with quasi-traits: If the underlying latent construct is a quasi-trait, such attempts may be futileit will be difficult (if not impossible) to formulate items that yield information across the continuum of the trait. Similarly we can ask whether it would be possible to develop items designed to capture even higher levels of presence of meaning, or whether we should conclude that the variation of presence of meaning is limited at the higher end of the continuum, although the majority of people attain such high levels.
If we settle with the conclusion that the majority of the population attained maximum levels of presence of meaning, we will inevitably have to re-evaluate the usefulness of, for example, interventions that aim to enhance meaning in life in the general population (most of whom have attained high levels of meaning in life). The question would be what the (large) portion of people with high levels of meaning would gain from interventions that intend to enhance meaning. Accepting that the majority of the population have already attained levels of presence of meaning that do not allow for much improvement may pose further questions. For example, could it be possible that icons of eudaimonic living, such as Mahatma Ghandi, Mother Theresa, or Nelson Mandela, who sacrificed their lives for a greater cause, have experienced levels of meaning in life similar to the majority of people? Or should we rather conclude that the nuances of presence of meaning at higher levels are just not captured by the current conceptualisation and operationalization of the construct?
Another way to explain the poor targeting of the MLQ-P may be that the subscale applies a rather narrow understanding of meaning in life, with all items paraphrasing the notion of having found a sense of meaning or purpose in life. By repeating the same content using slightly different syntax, the scale actually operates in a similar way to a one or two-item measure, which could contribute to the inability of the scale to differentiate well at the higher end of the continuum. Alternative measures that capture a broader sense of meaning in life, such as the Sources of Meaning and Meaning in Life Questionnaire (SoMe) that operationalises meaningfulness through coherence, significance, direction, and belonging [11], may display better sensitivity.
In addition, one can argue that participants' presence of meaning in life was not really as high as they indicated it to besocial desirability may have augmented their scores artificially. However, presence of meaning in life has been found to be unrelated to scores on measures of social desirability in several studies [15,49] and, as argued by Heintzelman and King [46], high presence of meaning scores have been found consistently among diverse samples, including anonymous samples where social desirability may not have been a big concern. The high scores could have also been due to a generalisation effectwhen asked to respond to items that concern global meaning in life, people may not be sure what meaning actually refers to. They may have a broad understanding of meaning and therefore think that they generally experience meaning. However, if the constituents of meaning are spelled out, they might realise that they don't have as much meaning as they initially thought.
One may also reason that the lack of sensitivity to varying nuances of meaning in life at the higher end of the continuum relates to the fact that the scale relies on self-report and alternative avenues to capture meaning in life should be explored. This approach may be problematic because meaning in life is, at its heart, a subjective experience. Several studies have argued that self-report is the best way to capture meaning in life [46,50,51]. However, obtaining self-report using less structured approaches may add value, for example by using experience sampling methods [52] or qualitative methods.

Differential Item Functioning (DIF)
The data in this study were gathered in three different countries and two gender groups, three age groups, and two levels of education were distinguished. Of all these demographic variables, significant DIF was only detected for items from the Presence subscale based on the country variable. The absence of DIF is the desirable outcome should data from the different demographic groups be combined or compared [53].
The significant country DIF for items from the Presence of Meaning subscale warrants further attention. Before removal of item 9 ("My life has no clear purpose"), the item exhibited DIF for country: Given equal levels of the latent trait, respondents from South Africa tended to respond more strongly towards the extreme True response categories than respondents from New Zealand and Australia, and, similarly participants from New Zealand were more inclined to extreme responses in the True direction than participants from Australia. After removal of item 9 and before collapsing the response categories, item 1 ("I understand my life's meaning") manifested DIF, where Australians found it harder to endorse the item than South Africans given equal levels of the latent construct. After collapsing categories, this finding was extendedrespondents from both New Zealand and Australia found it significantly harder to endorse item 1 than respondents from South Africa given equal levels of the construct. Also, before collapsing categories, participants from New Zealand found it harder to endorse item 4 ("My life has a clear sense of purpose") than participants from South Africa given equal levels of the latent trait. Last, given equal levels of the latent construct, participants from South Africa found it harder to endorse item 5 ("I have a good sense of what makes my life meaningful") than respondents from Australia and New Zealand, both before and after collapsing categories. Country-specific parameter estimates may be needed for these items of the Presence subscale, that is, the dataset can be split by country and these items should be calibrated separately for each country [54].
The two items that respondents from Australia and New Zealand found harder to endorse than South Africans given equal levels of the latent construct (i.e., items 1 and 4) refer to comprehending one's life meaning and having a clear sense of purposeboth can be seen as a global state of grasping one's life meaning, without referring to the elements that brings meaning to one's life. South Africa is a developing country and together with the many challenges the country faces come multiple opportunities for individuals to contribute and to have a sense of purpose. This may especially be the case for educated individuals who may feel that they have skills and knowledge that can really make a difference in a country with many challenges (based on the selection criteria of this study all participants had at least secondary education). Australia and New Zealand, on the other hand, are first world countries with a lot more stability and certainty. People from such countries may feel that things "go right" regardless of their contribution which may possibly lead to having a less clear sense of purpose and meaning comprehension. Another possible explanation may be connected with the fact that the specific South African group in this study exhibited a higher frequency of religious practice (mostly Christianity) than the participants from Australia and New Zealand. Religiosity may be associated with a clear sense of purpose and meaning comprehension.
The item that South Africans found harder to endorse than respondents from Australia and New Zealand given equal levels of the latent trait (item 5) refers to an awareness of the constituents of a meaningful lifethe elements that make one's life meaningful. One possibility is to infer that people (in this case, South Africans) who find it easier to agree with items referring to a global comprehension of one's life's meaning (items 1 and 4), may not have such a pressing need to know what the elements are that make their lives feel meaningfulone may argue that they take it for granted or that they spend less time attending to the specific details of why they find their lives meaningful. In contrast, people who find it more challenging to agree with items related to comprehending one's life meaning and having a clear sense of purpose (in this case respondents from Australia and New Zealand), may be more attentive to the things that add life meaning.
For both items 4 and 9, South Africans tended to answer more strongly in the True direction when compared to respondents from Australia and New Zealand given equal levels of the latent construct. In other words, South Africans were more inclined to find both the nonreversed, non-negated statement "My life has a clear sense of purpose" (item 4) and the reversed, negated statement "My life has no clear purpose" (item 9) true. This points to a discrepancy which poses questions about the possible influence of response styles involved in responding to the reversed item that could have caused DIF. This finding provides additional support for the deletion of item 9.
Since all aspects of Rasch analysis are interconnected [30], the existence of cross-country DIF on the Presence subscale could have influenced the rest of the findings. Future research should explore whether the findings of this study replicate in more culturally homogeneous samples where DIF is not present.

Limitations and future directions
While the study makes important contributions to the body of knowledge about meaning in life and the measurement thereof across three countries, it also possessed several limitations. This study made use of the Rasch model, which is considered to be a one-parameter IRT model that includes only item difficulty as a parameter. Although the Rasch model has very attractive mathematical properties, analysing MLQ data using more complex IRT models will also be of value.
In this study, recommendations regarding the removal of item 9 ("My life has no clear purpose") and category collapses were made a posteriori based on removing the item from and collapsing categories of data attained using the original full scale. These recommendations should be tested in new datasets gathered with a revised scale.
The fact that the sample in this study comes from three different countries can be seen as a strength in the sense that diversity is reflected in the study of an already well-established scale. In addition, it allowed us to investigate DIF across the three countries. The fact that evidence was found for DIF across the countries, however, points towards the possibility that the scale may function differentially across the different country groups which could have had an influence on the rest of the results. This suggests the need for repetition of the study in more culturally homogeneous groups to investigate whether the findings replicate when cross-country influences are not present.
Another important avenue for future research is the revisiting of presence of meaning in life as a construct, in particular with regards to the higher end of the construct continuum. The content domain of presence of meaning should be explored qualitatively in order to deepen our understanding of the construct, especially at high levels. For example, by investigating lay people's conceptualisations of meaning in life, we may identify sub-facets of meaning in life which may provide greater variance at the upper end of the continuum.

Conclusions
The rigorous measurement of meaning in life is essential for the study of this key aspect of well-being and quality of life. The present study was the first to apply item response theory, in particular Rasch modelling, to investigate the psychometric properties and item-level equivalence of the MLQ across different demographic variables. The study offered valuable insights into the functioning of the MLQ in groups from South Africa, Australia, and New Zealand and the construct of meaning in life and the measurement thereof in general. In particular, the MLQ displayed good psychometric potential from a Rasch modelling perspective. However, several directions for revision were highlighted. First, the study pointed out that seven response categories may be too many when measuring meaning in life in the general population, and suggested that five or six response categories may be more appropriate. In addition, the study confirmed the potential problems involved in reversed, negated items, and suggested that this type of item should rather be avoidedremoving the reversed