For the sake of unfolding our argument in the most straightforward fashion, we have used the psychometric terminology of classical test theory (e.g., that the observed score is a combination of true score plus error). Although this basic psychometric model is most familiar, several other psychometric theories have been applied to QOL assessment and several have gained popularity in recent years.
Cronbach et al.'s  elegant extension of classical test theory, generalizability theory, assumes that observations are randomly sampled from a universe of possible observations, along different facets of measurement. Generalizability theory primarily deals with issues of reliability. Items, observers, and occasions are all treated as random effects in order to identify sources of variation in measurement. However, the notion of the "universe score" in generalizability theory is similar to the true score concept in classical test theory. Observations may vary, but they are all estimates of a single true score. There is no analogue to appraisal in generalizability theory, although differences in appraisal might be expressed in terms of a person-by-occasion interaction with each facet of assessment. For example, estimates of inter-item variation (internal consistency) might differ from person to person because relationships among items might depend upon differences in appraisal processes. Generalizability theory methods would be able to detect certain ramifications of differences in appraisal, but additional appraisal constructs would be required to explain these differences.
An integration of generalizability theory and the appraisal paradigm might be accomplished by techniques for random effects models with nested data, such as hierarchical linear modeling (HLM) . As in generalizability theory, items would be treated as random effects, nested within individual respondents. Individual differences in appraisal parameters (level 2 independent variables) could be used to account for differences in variance among QOL items as well as correlations of QOL item ratings with item characteristics discussed by Bjorner, Ware and Kosinski  such as positive or negative valence, specificity, or type of rating scale (level 1 independent variables). The two-level HLM model could be further generalized to a three-level model to account for items nested within occasions within persons. The three-level model could incorporate changes in appraisal over occasions to determine whether response shifts affect relationships among items.
Structural equation modeling (SEM) also provides several useful ways to incorporate appraisal parameters in studies of the psychometric properties of QOL measures. SEM estimates of true scores on "latent variables" are based on the convergence of observed variables. One interesting approach to examining the internal consistency or factor structure of items on a QOL measure would be to disaggregate a sample according to appraisal parameters of interest and compare relationships among QOL items using confirmatory factor analysis. For example, appraisal assessment could identify one-year post-treatment cancer survivors who use different standards of comparison to judge QOL. Diverse QOL items could load on a single factor ("I now have less pain, more energy, am less worried, have gone back to work, and my mood is better") or yield a more complex factor structure ("I get tired more easily and I have had to slow down at work, but I don't have pain anymore and I don't worry about the small stuff"). This sort of difference in factor structure as a function of appraisal has implications for the use of QOL measures. As factorial complexity increases, the sensitivity of a full-scale score that combines all the items decreases. This is especially problematic in intervention studies  where the goal of treatment is to foster rehabilitation and reentry into normative roles and relationships. Selection of QOL outcome measures for such studies might be based on analyses demonstrating factorial invariance against differences on key appraisal parameters. Scale construction could be optimized to take into account the impact of anticipated group differences or individual changes in appraisal.
An analogous SEM approach could also be used to examine differences in the structural relationship between QOL measures and various antecedents and catalysts (see discussion of "Appraisal and Response Shift in the Regression Paradigm" in our companion piece ). We might predict that the correlations among indicators of functional impairment and overall well-being are significantly greater among those most concerned about maintaining highly active roles at work or in the community. SEM could be used to compare these relationships among groups identified as having relevant differences in their frames of reference, as a test of external construct validity of the QOL measures.
Over the past decade, item response theory (IRT) has been applied to the psychometric evaluation of QOL measures. IRT identifies characteristics of items in terms of changes in the probability of responses along a latent dimension, "theta", which is analogous to the underlying "true score" in classical test theory. Items vary in their ability to discriminate people with higher and lower values of theta. One advantage of IRT is the ability to select sets of items that discriminate along a continuum of levels of difficulty. For example, it is easier to say yes to "I am uncomfortable" than to "I am in agony".
IRT depends on the ability to identify coherent monotonic relationships between item responses and underlying latent dimensions. IRT cannot work if responses to items are ordered using different and unidentified underlying criteria. As such, IRT leads researchers to exclude items that do not clearly discriminate people at different levels of an underlying theta.
If we view variability in appraisal as an intrinsically meaningful aspect of QOL as opposed to a source of measurement error, the exclusion of items based on lack of fit with the IRT paradigm may be problematic. Limiting QOL items to those that can be precisely and consistently ordered may unduly constrain variability in QOL appraisal. Indeed, Bjorner, Ware and Kosinski  suggest that cognitive assessment could be used as an adjunct to IRT, to further assess individuals whose responses do not fit the parameters established for item difficulty in an IRT model. This could lead to expansion of item content or revised assumptions about the dimensionality of the QOL construct under investigation. We would further suggest that the specific inclusion of QOL items shown to be sensitive to differences in appraisal parameters may be quite desirable.
Assessment of appraisal parameters may complement IRT in development of computerized adaptive testing systems to estimate an individual's level of QOL. By incorporating measures of appraisal parameters, adaptive testing could be guided by individual information on the meaning of QOL and criteria for self-appraisal, as well as by estimated thresholds on different QOL items based solely on item characteristics in the aggregate. Appraisal assessment may lead CAT systems to focus on different areas of QOL for some people and to alter the ways that items are ordered and combined. Sufficient reason exists on theoretical, clinical and empirical grounds to argue that individuals can differ widely in their interpretation and use of QOL items. Adaptive testing to estimate all parameters of the contingent true score would provide information on patients sensitive to the full range of variation and diversity in QOL.