Skip to main content

Measurement invariance and adapted preferences: evidence for the ICECAP-A and WeRFree instruments



Self-report instruments are used to evaluate the effect of interventions. However, individuals adapt to adversity. This could result in individuals reporting higher levels of well-being than one would expect. It is possible to test for the influence of adapted preferences on instrument responses using measurement invariance testing. This study conducts such a test with the Wellbeing Related option-Freedom (WeRFree) and ICECAP-A instruments.


A multi-group confirmatory factor analysis was conducted to iteratively test four increasingly stringent types of measurement invariance: (1) configural invariance, (2) metric invariance, (3) scalar invariance, and (4) residual invariance. Data from the Multi Instrument Comparison study were divided into subsamples that reflect groups of participants that differ by age, gender, education, or health condition. Measurement invariance was assessed with (changes in) the Comparative Fit Index (CFI), Root Mean Square Error of Approximation (RMSEA), and Root Mean Square Residual (SRMR) fit indices.


For the WeRFree instrument, full measurement invariance could be established in the gender and education subsamples. Scalar invariance, but not residual invariance, was established in the health condition and age group subsamples. For the ICECAP-A, full measurement invariance could be established in the gender, education, and age group subsamples. Scalar invariance could be established in the health group subsample.


This study tests the measurement invariance properties of the WeRFree and ICECAP-A instruments. The results indicate that these instruments were scalar invariant in all subsamples, which means that group means can be compared across different subpopulations. We suggest that measurement invariance of capability instruments should routinely be tested with a reference group that does not experience a disadvantage to study whether responses could be affected by adapted preferences.


Policymakers need reliable information for decision-making. In health policy, this information is partially based on patient-reported outcomes. These outcomes reflect the patients’ experiences of their health condition, which might include an evaluation of how well-off they perceive to be [1]. In this context, adapted preferences could influence responses to instruments [2,3,4]. Adapted preferences have been defined as follows: “preferences formation or adaptation is the phenomenon whereby the subjective assessment of one’s well-being is out of line with the objective situation” [5, p. 137]. When responding to instruments, patients report a higher level of well-being than one would expect based on their health condition due to these adapted preferences [6, 7]. This is one form of response shift [8].

Differences in the interpretation of items have already been studied for instruments that are used in the wider health economic context [3, 9]. The authors of these studies indicate that such differences in the interpretation of items can affect decision-making when these instruments are used to establish the effect of health interventions [3, 9]. More specifically, the adaptation of preferences by patients might lead to an underestimation of the effect of new health technologies on well-being [10]. To illustrate, if a new health technology improves mobility, it might be difficult to measure its real effect when individuals who adapted to limited mobility report having a high initial level of mobility before the use of such a health technology [10]. This could lead to an unjust allocation of resources if the information that policymakers receive indicates that a new health technology only has a minor effect [10,11,12].

Adapted preferences might thus affect how individuals interpret and respond to instruments. It is therefore important to test whether different groups interpret and respond to items similarly to ensure that adapted preferences do not affect responses. One way of doing so is by testing for measurement invariance. Measurement invariance has been defined by Millsap [13, p. 462] as follows: “Some properties of a measure should be independent of the characteristics of the person being measured, apart from those characteristics that are the intended focus of the measure”.

Measurement invariance tests have been conducted to study whether instrument responses can be compared across cultures [14, 15], in education to study whether the measured ability of a student can be compared across groups (e.g. [16]), and in psychology to, for example, study if results from personality research can be compared and generalized to various populations [17]. In each of these fields, measurement invariance testing has been used to study whether responses to items are equivalent. This is not only important for research, but could also affect individuals’ lives directly. To illustrate, a mathematics test that is not measurement invariant might penalize certain groups for having a different socioeconomic background, which has little to do with the mathematical ability of a student. Also in the context of quality-of-life instruments measurement invariance testing has been one of the methods to establish whether the interpretation of items and their responses change over time in patient groups [18]. One explanation for this change is that patients adapt to their disease [19]. As such, a measurement invariance test can be a useful tool to study whether patients’ responses are affected by the adaptation of their preferences.

These tests have however not been routinely applied in capability approach inspired instruments in health economics. The capability approach is a theory developed by Sen [20]. Proponents of the capability approach argue that well-being should not only be assessed in terms of what people are or do (also called functionings) but also in terms of their freedom to be or do (capabilities). Based on this theory, several instruments have been developed to assess the impact of health interventions on well-being [21, 22].

Recent reviews of the psychometric properties of these capability instruments did not identify measurement invariance tests [23,24,25]. Besides these reviews, only one recent publication studied the measurement invariance properties of a capability instrument [26]. Amongst other things, this study tested the measurement invariance properties of the ICECAP-A in different subgroups in a sample of dermatological patients [26]. Measurement invariance could not be established in subgroups where participants were grouped according to age, marital status, or scores on a dermatology-specific quality-of-life index.

We also identified one further qualitative study that aimed to assess whether responses to the ICECAP-A, ICECAP-SCM, and EQ-5D-5 L were influenced by adapted preferences utilizing think-aloud interviews [27]. The authors of this study concluded that there was little indication of adapted preferences in an end-of-life setting [27]. Although this study provides an important insight into this particular group’s reasoning when responding to items, it is unclear if these responses are comparable across groups from a psychometric perspective.

Previous studies in quality of life research have shown that age [4, 28,29,30], education [31], gender [29], and health condition [30, 32] could affect the interpretation of items. One explanation for these differences is that individuals adapt to adversity [30].

Hence, the primary aim of this study is to establish whether capability instruments can be shown to be measurement invariant across groups of individuals that differ in terms of age, education, gender, or health conditions.



The Wellbeing Related option-Freedom instrument (WeRFree) instrument is a newly developed instrument that shows the benefits of developing surveys with a comprehensive conceptualization of the concept of “capability” [33]. The WeRFree instrument consists of 3 scales with a total of 15 items that measure health-related capabilities and subjective well-being [33]. These three scales represent different elements of capability – and subjective well-being. Capability well-being is captured with the “perceived access to options” scale and consists of five items measuring various aspects of health-related capabilities. Different elements of how people experience living with those capabilities are captured with the reflective wellbeing (six items) and affective wellbeing (four items) scales. All items follow a Likert scale format, with response options ranging from four to eleven categories. Depending on the construct, items inquire about the extent that individuals feel satisfied with various aspects of their lives (from completely dissatisfied to completely satisfied), whether they disagree with certain statements (from strongly disagree to strongly agree), whether they experienced certain emotions over the last four weeks (e.g. from all of the time to none of the time), and whether individuals can complete certain tasks (e.g. whether an individual can do tasks very quickly and efficiently without any help to not being able to do these tasks themselves). The WeRFree instrument was developed by matching items from the Multi-Instrument-Comparison (MIC) study database with constructs from an earlier developed theoretical framework by the authors [33, 34]. Further information about the (theoretical) background of the instrument can be found in [22, 33, 34].

The ICEpop CAPability measure for Adults (ICECAP-A) is an instrument that was developed to assess the capability well-being of adults [35, 36]. The ICECAP-A measures capabilities in five domains: stability, attachment, autonomy, achievement, and enjoyment. Each of these domains consists of a single item, with each item having four response options. Each item inquires about the level of capability, ranging from no capability (I cannot…, I am unable…) to full capability (I can…, I am able to…). Together, these items reflect the capability well-being of individuals. The domains and items were developed through interviews with the general population of England [35]. Evidence indicates that the instrument shows construct validity, content validity and responsiveness in a number of different populations [25].


For this study, the MIC study database was used [37]. The MIC study had the objective to analyze and compare a set of HRQoL and well-being instruments. The general questionnaire of this study consisted of eleven such instruments. Following a cross-sectional design, the study was conducted in six countries: Australia, Canada, Germany, Norway, the United Kingdom, and the USA. A total of 9665 respondents participated in completing the general questionnaire. Informed consent was obtained from all individual participants included in the study. Individuals were recruited with nine different health conditions: arthritis, asthma, cancer, depression, diabetes, hearing problems, heart problems, stroke, and obstructive pulmonary disease. Additionally, a group of healthy individuals was recruited. Unreliable responses were removed from the database by the MIC study team. Responses were deemed unreliable if they showed inconsistencies in responses (i.e. between items that are similar) and if respondents took too little time to complete the general questionnaire. After the removal of these responses, the MIC study database consisted of 8022 observations. Further information about the MIC study can be found on the website of the project [38]. Concerning the analysis of the ICECAP-A, all the responses of the MIC database were used, except those from Norway, since in Norway the ICECAP-A instrument was not administered. For measurement invariance testing, different subsamples were created based on the characteristics of the participants. Participants were grouped according to their age, level of education, gender, and health condition. Measurement invariance was then tested in each of these subsamples with the WeRFree and ICECAP-A instruments.


Before conducting a measurement invariance study, the dimensionality of instruments needs to be studied. This was done through a confirmatory factor analysis (CFA). Model fit was considered acceptable when the following fit index values reached certain values: Comparative Fit Index (CFI) with a value higher than 0.900, Tucker-Lewis Index (TLI) with a value higher than 0.900, Root Mean Square Error of Approximation (RMSEA) with a value lower than 0.08, and Standardized Root Mean Square Residual (SRMR), with a value lower than 0.08 [39,40,41]. The model fit of the WeRFree instrument with the MIC data has been presented in an earlier study that further explains how the instrument was developed [33]. In the case of the ICECAP-A, we followed the approach of Rencz, Mitev [26] and conducted a CFA to study the dimensionality of the ICECAP-A, for which we assumed that the five items reflect one construct: capability wellbeing. Additionally, the Cronbach’s alpha was computed, with a cut-off value of > 0.7 deemed acceptable.

A multi-group CFA was conducted to test for four different types of measurement invariance: (1) configural invariance, (2) metric (or weak factorial) invariance, (3) scalar (or strong factorial) invariance, and (4) residual (or strict) invariance [41,42,43,44]. These types were tested sequentially since for each type of measurement invariance a different model is constructed that is more restrained than the last model.

An instrument is (1) configural invariant if its factorial structure can be reproduced in different groups. In the case of the current study, this would for example mean that the three-factor structure of the WeRFree instrument can be replicated in different groups. When configural invariance can be established, (2) metric invariance can be tested [41, 42]. An instrument is metric invariant when the factor loadings are invariant across different groups. The factor loading represents the strength of the relationship between a construct and an item, or, in other words, how far a change in a construct influences the response to an item from an individual. Invariant factor loadings indicate that the constructs influence changes in item scores in the same way in different groups. The third type of invariance that is tested for in this study is (3) scalar invariance. An instrument is scalar invariant when the intercepts of each item are the same across different health conditions. Once scalar invariance is established, it is possible to compare the mean scores of the scales between different groups [41, 42]. Lastly, the (4) residual invariance properties were studied. Essentially, this means that the residuals of the items are similar across different groups. This indicates that the mean differences in scale scores that can be observed between groups are a result of differences in the latent construct and are not caused by other factors [41, 42]. This provides additional confidence that the difference in mean scores is indeed driven by differences in the latent construct of interest and not by other unmeasured constructs [41, 42].

In the current analysis, for both the WeRFree instrument and the ICECAP-A, mean factor scores will be presented. Furthermore, for the WeRFree instrument, adjusted scale scores are presented. Due to the varying number of response categories of the items, scale scores were normalized by dividing the number of response categories of items by their respective length (e.g. an item with a score from 0 to 3 was divided by 3), multiplying that score by 100, and dividing that score by the number of items in a scale to ensure that the score of each item contributed equally to the overall score of scale. Also ICECAP-A scores are presented, with raw index values being adjusted according to the United Kingdom tariff developed by Flynn, Huynh [36]. This score ranges from zero to one, with a zero reflecting a state of no capability and a one a state of full capability [36].

Various fit indices were used to establish measurement invariance. The following fit index values were used to establish configural fit: CFI with a value higher than 0.900, RMSEA with a value lower than 0.08, and SRMR with a value lower than 0.08 [39,40,41]. To study the other forms of measurement invariance, we followed the suggested fit index values by Chen [41] for group sizes that are equal to or larger than 300, because the sample sizes of the groups in the different subsamples are larger than 300. For further measurement invariance testing, the ΔCFI, the ΔRMSEA, and the ΔSRMR fit indices were used. A score of ≥ 0.010 in ΔCFI, ≥ 0.015 in ΔRMSEA, and a score of ≥ 0.030 in SRMR indicated noninvariance regarding metric invariance. Scores of ≥-0.010 in ΔCFI, ≥ 0.015 in ΔRMSEA, and ≥ 0.010 in SRMR were used as an indication of noninvariance regarding scalar and residual invariance. The chi-square difference test was not used to assess and compare model fit, because of the large sample sizes of the subsamples, which would result in trivial differences in model fit being flagged as significant [41].

For the analysis presented in this manuscript, the Lavaan package was used in R [45]. Because some response options of some of the items included in this study received close to no responses, it was decided not to use polychoric correlations, since in such cases correlations could be estimated incorrectly, which affects the estimation of parameters of CFA models [46]. Instead, Pearson correlations were used for model estimation, given that the sample sizes in each group were reasonably large (the smallest group had more than 500 observations, see Table 1) and that the number of response options for the items was generally larger than five. In such conditions, authors have argued that data can be treated as continuous [47, 48]. For the same reasons, it was decided to estimate the models with a maximum likelihood estimator [47, 49]. In these estimates, missing data were handled through a full information maximum likelihood estimation of the models [50].



Table 1 presents the sample size per subsample, as well as the size of different groups within those subsamples. It should be noted that the total size of the health condition subsample is slightly lower compared to the size of the other subsamples. This is a consequence of the deletion of two “artifact” disease groups. During the recruitment phase of the MIC study project, the Australian arm also recruited patients affected by stroke and chronic obstructive pulmonary disease. These subgroups consisted of 23 and 66 participants respectively. The sample sizes of these groups were considered to be inadequate for further analysis and the observations were not included for measurement invariance testing in the health condition subsample. Furthermore, 15 observations in the MIC database showed missing data concerning the items included on the “Reflective Wellbeing” scale of the WeRFree instrument.

Table 1 Sample size per group

WeRFree instrument

As mentioned in the methods section, the WeRFree instrument has shown an adequate fit with the MIC data (χ2: 1,756.8, df: 87, CFI: 0.970, TLI: 0.963, RMSEA: 0.055, SRMR: 0.036, see Ubels, Hernandez-Villafuerte [33]). Also, the three scales of the WeRFree instrument showed adequate reliability (Perceived Access to Options: Cronbach’s alpha of 0.89, Affective Wellbeing: Cronbach’s alpha of 0.83, Reflective Wellbeing: Cronbach’s alpha of 0.89, see Ubels, Hernandez-Villafuerte [33]). The results of the measurement invariance tests are presented in Table 2. Configural invariance was established in every subsample: the highest value for the upper level of the RMSEA 90% confidence interval was reached in the health condition and age group subsamples with a value of 0.060, the highest SRMR value is 0.041 in the health condition subsample, and the lowest CFI value being 0.961 in the health condition subsample. Metric invariance was also established in every subsample. The largest reduction in model fit in terms of CFI and SRMR could be identified in the health condition subsample, with a reduction of 0.003 and 0.008 respectively. Scalar invariance was also established in every subsample. The largest reductions in RMSEA and SRMR, 0.004 and 0.004 respectively, were identified in the age groups subsample, furthermore, a 0.010 (rounded up) reduction in CFI was identified in the health condition subsample. Residual invariance was not established in the age group and health condition subsamples. To conclude, the WeRFree instrument was measurement invariant up to scalar invariance in the health condition and age group subsamples. Full measurement invariance was established in the gender and education subsamples. Table 3 presents the mean scale scores with the associated standard deviations, as well as the standardized factor means per subsample for the constructs of the WeRFree instrument.

Table 2 Measurement invariance of the WeRFree instrument per subsample
Table 3 Mean scale scores, associated standard deviations and standardized mean factor scores per subgroup per sample for the WeRFree instrument


The initial model, in which all of the items of the ICECAP loaded on one factor, showed inadequate fit in terms of the RMSEA index value (CFI = 0.961, TLI = 0.922, RMSEA = 0.129, SRMR = 0.033). Upon inspecting the modification indices, we found that two pairs of items showed local dependencies: the items related to attachment and enjoyment (expected improvement in ΔΧ2 of 329, expected standardized correlation of 0.281), and the items related to autonomy and achievement (expected improvement in ΔΧ2 of 320, expected standardized correlation of 0.318). Due to the small difference in the change in ΔΧ2, and the fact that the next two largest sources of misfit were also associated with the attachment item (expected improvement in ΔΧ2 of 179 when correlated with the achievement item and an expected improvement in ΔΧ2 of 145 when correlated with the autonomy item), we decided to first correlate the attachment and the enjoyment items. Still, the RMSEA indicated inadequate fit (CFI = 0.982, TLI = 0.955, RMSEA = 0.099, SRMR = 0.033). Therefore, we decided to correlate the error terms of the autonomy and achievement items, which resulted in an adequate fit (CFI = 0.995, TLI = 0.982, RMSEA = 0.062, SRMR = 0.013). This resulted in the measurement model presented in Fig. 1, which also presents standardized values for various parameters. The Cronbach’s alpha of the ICECAP-A with the complete sample of the MIC study database is 0.85.

Fig. 1
figure 1

Measurement model of the ICECAP-A with standardized parameter values

Table 4 presents the results of the measurement invariance test of the ICECAP-A instrument. Configural invariance was established in every subsample: the highest value for the upper level of the RMSEA 90% confidence interval was reached in the age group subsamples with a value of 0.078. In terms of SRMR, the highest value is 0.013 in the age group subsample. The CFI scores were generally very high, around 0.995 in every subsample. Metric invariance of the ICECAP-A was also established in every subsample. In terms of RMSEA, the model fit improved in every subsample. A particular large negative change in terms of SRMR could be identified in the health condition subsample, with a change of 0.017. Scalar invariance was also established in every subsample, although borderline in the age group subsamples in terms of RMSEA (ΔRMSEA = 0.010 rounded). The CFI values of the age group and health condition subsample changed by 0.009 and 0.008 respectively. Residual invariance could not be established for the health condition subsample (ΔCFI = -0.026, ΔRMSEA = 0.018, ΔSRMR = 0.024). The other subsamples were residual invariant. This means that for the ICECAP-A, full measurement invariance has been established in the age group, gender, and education subsamples. Table 5 presents the adjusted scores of the ICECAP-A with associated standard deviations, as well as mean factor scores.

Table 4 Measurement invariance of the ICECAP-A per subsample
Table 5 ICECAP-A scores and associated standard deviation per group


In this study, the measurement invariance properties of the WeRFree and the ICECAP-A instruments were tested. Before testing the measurement invariance properties of the ICECAP-A, it was necessary to adjust its measurement model by correlating two error terms, because the one factor model without error terms indicated insufficient fit. Given that these adjustments were data-driven, only post-hoc explanations can be provided for why these items might correlate. In the case of the attachment and enjoyment items, it could be that there is an additional correlation between these items due to the strong relationship between social relations and happiness. The errors of these items were also correlated in a previous study by Rencz and Mitev [26]. Such a relationship might also exist for the achievement and autonomy items, since experiences of independence and progress could be closely related to each other and might exhibit correlations that are not explained by the overall latent variable of capability wellbeing. These relationships could be an interesting subject for future confirmatory studies.

In the current study, the instruments were shown to have configural, metric, and scalar invariant properties in the tested subsamples. The establishment of scalar invariance in every subsample indicates that the instruments’ mean scores can be compared on a group level. By comparing the responses of individuals who are relatively disadvantaged in terms of their capabilities (e.g. due to disease) with a reference group (e.g. healthy individuals), it is possible to establish whether responses are affected by adapted preferences. Such reference groups have also been used, albeit not routinely, to test for response shift in patient responses [18, 19].

In the context of testing for the measurement invariance properties of capability instruments in populations that differ in terms of their health condition, the identification of a reference group might be a challenge. Such a reference group should have a set of capabilities that ensures that adapted preferences do not affect the responses of this reference group. However, what such a set entails or how such a list should be constructed is not clear [51], which complicates the identification of a reference group. In this context, more research is necessary. For the time being, it might be sufficient to use a sample from the general population that is reasonably healthy to test for adapted preferences in individuals with health problems.

As was mentioned in the introduction, testing for measurement invariance could indicate how adapted preferences affect responses to instruments. In this context, it is important to note that establishing measurement invariance between advantaged and disadvantaged groups is not evidence against the existence of adapted preferences. As noted, measurement invariance testing merely tests whether response patterns of items differ between different groups. Systematic differences in how individuals respond to instruments between groups, such as overall response styles, are hard to detect with such tests [52]. Furthermore, if measurement invariance cannot be established, it should be noted that the source of measurement noninvariance does not necessarily need to be adapted preferences, since there can be several alternative explanations for why individuals interpret items differently. Lastly, it should be noted that depending on the research aim, different levels of measurement invariance might be sufficient. For example, for studying the correlations between constructs, it is sufficient to establish configural invariance. If a study aims to research a change in a construct of interest, which is often the case in health economics, it is sufficient to establish metric variance.

When measurement invariance cannot be established, further studies can be conducted to identify the source of measurement non-invariance [44, 53]. It should however be noted that establishing non-invariance does not mean that groups cannot meaningfully be compared. Indeed, it can be the case that the non-invariance of items is symmetrically distributed, which means that the non-invariance of multiple items has little effect on the scale score [52]. As such, it is important to study the pattern of non-invariance [54]. Studying these patterns could also lead into interesting insights in how items are interpreted and responded to [54], which could further result in deeper insights in how people experience their capability well-being.


The recruitment strategy of the MIC study aimed at recruiting a sufficient number of participants from different health backgrounds for their database that gave reliable responses [38]. As such, the database was not necessarily designed to reflect specific (sub-) populations. Therefore, the measurement invariance test results as well as the comparison of scale scores and factor means should not directly be generalized. A further limitation is that in the current analysis, the overall sample is divided into different subsamples based on variables that are probably not independent from each other. This affects the interpretation of non-invariance test results, since it is unclear what the exact source of residual noninvariance is. For example, in the case of the WeRFree instrument, residual invariance could not be established in the health condition and age group subsamples. In this case, it is unclear if age, the health condition, or an interaction between age and health condition could explain why this invariance exists. Given that the MIC study sample was not meant to reflect specific populations, we decided to not test in detail what the source of noninvariance was, since the result of such a test would only have limited generalizability.

Another limitation concerns the use of the MIC study database to both develop an instrument and test the measurement invariance properties of the WeRFree instrument. Due to using the same database for both these studies, measurement errors that can be attributed to the design of the MIC survey may be unaccounted for. As a consequence, the measurement models might overfit, which in the context of the present study means that the measurement invariance properties of the WeRFree instrument can be overestimated.


To conclude, this study shows how measurement invariance testing can be used to research whether adapted preferences influence instrument responses. The study shows that the WeRFree and ICECAP-A instruments are at least scalar invariant in various subpopulations of the MIC study. This indicates that aggregated responses can be compared across different groups. However, due to the limitations of this study, this result needs to be confirmed in other samples. In the context of capability instrument development, future studies should focus on establishing the measurement invariance properties of these instruments. This would clarify whether information from self-report capability instruments is comparable across groups that differ in terms of their relative advantage.

Data Availability

The data that support the findings of this study are not publicly available. Data can however be requested by contacting the researchers that are responsible for the Multi Instrument Comparison Study.


  1. Doward LC, McKenna SP. Defining patient-reported outcomes. Value in Health. 2004;7:4–S8.

    Article  Google Scholar 

  2. Huang I-C, Leite WL, Shearer P, Seid M, Revicki DA, Shenkman EA. Differential item functioning in quality of life measure between children with and without special health-care needs. Value in Health. 2011;14(6):872–83.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Smith AB, Cocks K, Parry D, Taylor M. A differential item functioning analysis of the EQ-5D in cancer. Value in Health. 2016;19(8):1063–7.

    Article  PubMed  Google Scholar 

  4. Knott RJ, Lorgelly PK, Black N, Hollingsworth B. Differential item functioning in quality of life measurement: an analysis using anchoring vignettes. Soc Sci Med. 2017;190:247–55.

    Article  PubMed  Google Scholar 

  5. Robeyns I. Clarifications. Wellbeing, freedom and Social Justice: the capability approach re-examined. Open Book Publishers; 2017. pp. 89–168.

  6. Mitchell P. Adaptive preferences, adapted preferences. Mind. 2018;127(508):1003–25.

    Article  Google Scholar 

  7. Ubel PA, Loewenstein G, Schwarz N, Smith D. Misimagining the unimaginable: the disability paradox and health care decision making. Health Psychol. 2005;24(4S):57.

    Article  Google Scholar 

  8. Ilie G, Bradfield J, Moodie L, Lawen T, Ilie A, Lawen Z, et al. The role of response-shift in studies assessing quality of Life outcomes among Cancer patients: a systematic review. Front Oncol. 2019;9:783.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Penton H, Dayson C, Hulme C, Young T. An investigation of age-related differential item functioning in the EQ-5D-5L using item response theory and logistic regression. Value in Health. 2022;25(9):1566–74.

    Article  PubMed  Google Scholar 

  10. Knott RJ, Black N, Hollingsworth B, Lorgelly PK. Response-scale heterogeneity in the EQ‐5D. Health Econ. 2017;26(3):387–94.

    Article  PubMed  Google Scholar 

  11. Groot W. Adaptation and scale of reference bias in self-assessments of quality of life. J Health Econ. 2000;19(3):403–20.

    Article  CAS  PubMed  Google Scholar 

  12. Mitchell PM, Roberts TE, Barton PM, Coast J. Assessing sufficient capability: a new approach to economic evaluation. Soc Sci Med. 2015;139:71–9.

    Article  PubMed  Google Scholar 

  13. Millsap RE. Invariance in measurement and prediction revisited. Psychometrika. 2007;72(4):461–73.

    Article  Google Scholar 

  14. Jang S, Kim ES, Cao C, Allen TD, Cooper CL, Lapierre LM, et al. Measurement invariance of the satisfaction with life scale across 26 countries. J Cross-Cult Psychol. 2017;48(4):560–76.

    Article  Google Scholar 

  15. Jeong S, Lee Y. Consequences of not conducting measurement invariance tests in cross-cultural studies: a review of current research practices and recommendations. Adv Developing Hum Resour. 2019;21(4):466–83.

    Article  Google Scholar 

  16. Odell B, Gierl M, Cutumisu M. Testing measurement invariance of PISA 2015 mathematics, science, and ICT scales using the alignment method. Stud Educational Evaluation. 2021;68:100965.

    Article  Google Scholar 

  17. Dong Y, Dumas D. Are personality measures valid for different populations? A systematic review of measurement invariance across cultures, gender, and age. Pers Indiv Differ. 2020;160:109956.

    Article  Google Scholar 

  18. Sajobi TT, Brahmbatt R, Lix LM, Zumbo BD, Sawatzky R. Scoping review of response shift methods: current reporting practices and recommendations. Qual Life Res. 2018;27(5):1133–46.

    Article  PubMed  Google Scholar 

  19. Schwartz CE, Bode R, Repucci N, Becker J, Sprangers MA, Fayers PM. The clinical significance of adaptation to changing health: a meta-analysis of response shift. Qual Life Res. 2006;15(9):1533–50.

    Article  PubMed  Google Scholar 

  20. Sen A. Well-being, agency and freedom: the Dewey lectures 1984. J Philos. 1985;82(4):169–221.

    Google Scholar 

  21. Mitchell PM, Roberts TE, Barton PM, Coast J. Applications of the capability approach in the health field: a literature review. Soc Indic Res. 2017;133(1):345–71.

    Article  PubMed  Google Scholar 

  22. Ubels J, Hernandez-Villafuerte K, Schlander M. The value of freedom: a review of the current developments and conceptual issues in the measurement of capability. J Hum Dev Capabilities. 2022:1–27.

  23. Helter TM, Coast J, Łaszewska A, Stamm T, Simon J. Capability instruments in economic evaluations of health-related interventions: a comparative review of the literature. Qual Life Res. 2020;29(6):1433–64.

    Article  PubMed  Google Scholar 

  24. Till M, Abu-Omar K, Ferschl S, Reimers AK, Gelius P. Measuring capabilities in health and physical activity promotion: a systematic review. BMC Public Health. 2021;21(1):1–23.

    Article  Google Scholar 

  25. Afentou N, Kinghorn P. A systematic review of the feasibility and psychometric properties of the ICEpop CAPability measure for adults and its use so far in economic evaluation. Value in Health. 2020;23(4):515–26.

    Article  PubMed  Google Scholar 

  26. Rencz F, Mitev AZ, Jenei B, Brodszky V. Measurement properties of the ICECAP-A capability well-being instrument among dermatological patients. Qual Life Res. 2021:1–13.

  27. Coast J, Bailey C, Orlando R, Armour K, Perry R, Jones L, et al. Adaptation, acceptance and adaptive preferences in health and capability well-being measurement amongst those approaching end of life. The Patient-Patient-Centered Outcomes Research. 2018;11(5):539–46.

    Article  PubMed  Google Scholar 

  28. King-Kallimanis BL, Ter Hoeven CL, de Haes HC, Smets EM, Koning CC, Oort FJ. Assessing measurement invariance of a health-related quality-of-life questionnaire in radiotherapy patients. Qual Life Res. 2012;21:1745–53.

    Article  PubMed  Google Scholar 

  29. van Roij J, Kieffer JM, van de Poll-Franse L, Husson O, Raijmakers NJ, Gelissen J. Assessing measurement invariance in the EORTC QLQ-C30. Qual Life Res. 2022:1–13.

  30. Dabakuyo T, Guillemin F, Conroy T, Velten M, Jolly D, Mercier M, et al. Response shift effects on measuring post-operative quality of life among Breast cancer patients: a multicenter cohort study. Qual Life Res. 2013;22:1–11.

    Article  CAS  PubMed  Google Scholar 

  31. Perkins AJ, Stump TE, Monahan PO, McHorney CA. Assessment of differential item functioning for demographic comparisons in the MOS SF-36 health survey. Qual Life Res. 2006;15:331–48.

    Article  PubMed  Google Scholar 

  32. Tessier P, Blanchin M, Sébille V. Does the relationship between health-related quality of life and subjective well-being change over time? An exploratory study among Breast cancer patients. Soc Sci Med. 2017;174:96–103.

    Article  PubMed  Google Scholar 

  33. Ubels J, Hernandez-Villafuerte K, Schlander M. The value of freedom: the development of the WeRFree capability instrument. medRxiv. 2022. 2022.10.05.22280720.

  34. Ubels J, Hernandez-Villafuerte K, Niebauer E, Schlander M. The value of freedom: extending the evaluative space of capability. medRxiv. 2022.

  35. Al-Janabi H, Flynn N, Coast T. Development of a self-report measure of capability wellbeing for adults: the ICECAP-A. Qual Life Res. 2012;21(1):167–76.

    Article  PubMed  Google Scholar 

  36. Flynn TN, Huynh E, Peters TJ, Al-Janabi H, Clemens S, Moody A, et al. Scoring the ICECAP‐A capability instrument. Estimation of a UK general population tariff. Health Econ. 2015;24(3):258–69.

    Article  PubMed  Google Scholar 

  37. Richardson J, Khan M, Iezzi A, Maxwell A. Cross-national comparison of twelve quality of life instruments. Res Papers. 2012;78:80–3. MIC report.

    Google Scholar 

  38. The Multi Instrument Comparison (MIC) project. Accessed 7 Nov 2023.

  39. Bentler PM, Bonett DG. Significance tests and goodness of fit in the analysis of covariance structures. Psychol Bull. 1980;88(3):588.

    Article  Google Scholar 

  40. Lt H, Bentler PM. Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Struct Equation Modeling: Multidisciplinary J. 1999;6(1):1–55.

    Article  Google Scholar 

  41. Chen FF. Sensitivity of goodness of fit indexes to lack of measurement invariance. Struct Equation Modeling: Multidisciplinary J. 2007;14(3):464–504.

    Article  Google Scholar 

  42. Widaman KF, Reise SP. Exploring the measurement invariance of psychological instruments: Applications in the substance use domain. 1997.

  43. Meredith W, Teresi JA. An essay on measurement and factorial invariance. Med Care. 2006:S69–S77.

  44. Putnick DL, Bornstein MH. Measurement invariance conventions and reporting: the state of the art and future directions for psychological research. Dev Rev. 2016;41:71–90.

    Article  PubMed  PubMed Central  Google Scholar 

  45. Rosseel Y. Lavaan: an R package for structural equation modeling. J Stat Softw. 2012;48:1–36.

    Article  Google Scholar 

  46. Flora DB, Curran PJ. An empirical evaluation of alternative methods of estimation for confirmatory factor analysis with ordinal data. Psychol Methods. 2004;9(4):466.

    Article  PubMed  PubMed Central  Google Scholar 

  47. Rhemtulla M, Brosseau-Liard PÉ, Savalei V. When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychol Methods. 2012;17(3):354.

    Article  PubMed  Google Scholar 

  48. Liu Y, Millsap RE, West SG, Tein J-Y, Tanaka R, Grimm KJ. Testing measurement invariance in longitudinal data with ordered-categorical measures. Psychol Methods. 2017;22(3):486.

    Article  PubMed  Google Scholar 

  49. Bandalos DL. Relative performance of categorical diagonally weighted least squares and robust maximum likelihood estimation. Struct Equation Modeling: Multidisciplinary J. 2014;21(1):102–16.

    Article  Google Scholar 

  50. Enders CK, Bandalos DL. The relative performance of full information maximum likelihood estimation for missing data in structural equation models. Struct Equ Model. 2001;8(3):430–57.

    Article  Google Scholar 

  51. Robeyns I. Sen’s capability approach and gender inequality: selecting relevant capabilities. Fem Econ. 2003;9(2–3):61–92.

    Article  Google Scholar 

  52. Robitzsch A, Lüdtke O. Why full, partial, or approximate measurement Invariance are not a prerequisite for meaningful and valid Group comparisons. Struct Equation Modeling: Multidisciplinary J. 2023;30(6):859–70.

    Article  Google Scholar 

  53. Jung E, Yoon M. Comparisons of three empirical methods for partial factorial invariance: Forward, backward, and factor-ratio tests. Struct Equation Modeling: Multidisciplinary J. 2016;23(4):567–84.

    Article  Google Scholar 

  54. Fischer R, Karl J, Luczak-Roesch M. Why equivalence and invariance are both different and essential for scientific studies of culture: A discussion of mapping processes and theoretical implications. 2022.

Download references


Open Access funding enabled and organized by Projekt DEAL. The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Author information

Authors and Affiliations



Jasper Ubels conceptualized and designed the study, analysed and interpreted the data, conducted the statistical analysis, and drafted the manuscript. Michael Schlander critically revised the manuscript and supervised the project.

Corresponding author

Correspondence to Jasper Ubels.

Ethics declarations

Ethics approval

The current study was conducted according to the 1964 Declaration of Helsinki and its later amendments. The original study received ethical approval from the Monash University Human Research Ethics Committee. For the use of this database for the current study, a positive ethical evaluation was obtained from the ethics committee of the Medical Faculty of the University of Heidelberg. Informed consent was obtained from all individual participants to participate in the MIC study. The participants were informed that their data could be used for future studies in anonymized form.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ubels, J., Schlander, M. Measurement invariance and adapted preferences: evidence for the ICECAP-A and WeRFree instruments. Health Qual Life Outcomes 21, 121 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: