Comparing five depression measures in depressed Chinese patients using item response theory: an examination of item properties, measurement precision and score comparability
© The Author(s). 2017
Received: 18 November 2016
Accepted: 15 March 2017
Published: 4 April 2017
Item response theory (IRT) has been increasingly applied to patient-reported outcome (PRO) measures. The purpose of this study is to apply IRT to examine item properties (discrimination and severity of depressive symptoms), measurement precision and score comparability across five depression measures, which is the first study of its kind in the Chinese context.
A clinical sample of 207 Hong Kong Chinese outpatients was recruited. Data analyses were performed including classical item analysis, IRT concurrent calibration and IRT true score equating. The IRT assumptions of unidimensionality and local independence were tested respectively using confirmatory factor analysis and chi-square statistics. The IRT linking assumptions of construct similarity, equity and subgroup invariance were also tested. The graded response model was applied to concurrently calibrate all five depression measures in a single IRT run, resulting in the item parameter estimates of these measures being placed onto a single common metric. IRT true score equating was implemented to perform the outcome score linking and construct score concordances so as to link scores from one measure to corresponding scores on another measure for direct comparability.
Findings suggested that (a) symptoms on depressed mood, suicidality and feeling of worthlessness served as the strongest discriminating indicators, and symptoms concerning suicidality, changes in appetite, depressed mood, feeling of worthlessness and psychomotor agitation or retardation reflected high levels of severity in the clinical sample. (b) The five depression measures contributed to various degrees of measurement precision at varied levels of depression. (c) After outcome score linking was performed across the five measures, the cut-off scores led to either consistent or discrepant diagnoses for depression.
The study provides additional evidence regarding the psychometric properties and clinical utility of the five depression measures, offers methodological contributions to the appropriate use of IRT in PRO measures, and helps elucidate cultural variation in depressive symptomatology. The approach of concurrently calibrating and linking multiple PRO measures can be applied to the assessment of PROs other than the depression context.
KeywordsItem response theory Outcome score linking Depressive symptomatology Measurement precision Score concordances Patient-reported outcome measures
With a growing emphasis on patient-centered care, the recent surge in the use of high-quality data from psychometrically sound patient-reported outcome (PRO) measures has engendered the opportunity to use PROs to inform healthcare practices and guide healthcare decision making. In a commissioned paper by the U.S. National Quality Forum on the issues to consider when evaluating PROs as candidate performance measures in healthcare settings, Cella et al.  remarked on several methodological issues related to the use of PROs in patient-centered outcome research. One issue focuses on establishing standardized metrics and deriving comparable scores across different PRO measures of the same construct to facilitate direct comparisons between PROs. In addition, the authors highlighted a number of PRO characteristics to consider when selecting appropriate PROs. Measurement precision was among the most important characteristics, as PRO measures with greater measurement precision appear to show greater sensitivity to change . PRO measures not only have great potential to be integrated into healthcare practice but also substantially contribute to elucidating the properties of symptoms directly reported by patients (see for example ).
In response to the aforementioned methodological issues, item response theory (IRT)  offers promising solutions to address issues that have been difficult to solve through classical methods, and recently, IRT has been increasingly applied to PRO measures. In comparison with classical test theory, IRT offers a number of benefits. First, the application of IRT in examinations of item properties (items can be considered symptoms) adds knowledge regarding the level of severity and discriminating abilities of various symptoms. Such knowledge is of particular clinical interest for assessing symptomatology, as some items may hold higher discriminatory power for differentiating varied levels of clinical latent traits, while other items may reflect more severe symptoms. Second, comparisons from IRT-derived test information functions and their associated standard errors of measurement yield useful information about the contribution of different measures to measurement precision along the latent trait continuum. Clinicians can then determine the most useful and precise measures for assessing a specific level/range of the latent trait in either clinical or epidemiological populations. Third, IRT allows for a common metric on which the item parameters of multiple measures can be placed, and hence, score concordances can be constructed to link scores from one measure to corresponding scores on another measure, in order to facilitate direct comparability across measures. Clinicians can then further investigate whether the conventional cut-off scores on different measures lead to a convergent or divergent solution for clinical and epidemiological decision making.
Major depressive disorder (MDD) is among the significant causes of disease burden worldwide . Regarding the measures of depressive symptomatology, to date, several well developed and carefully validated PRO measures, such as the Beck Depression Inventory–II (BDI-II) , the Center for Epidemiologic Studies Depression Scale (CES-D) , the Patient Health Questionnaire (PHQ-9) , the depression subscale of the Depression, Anxiety and Stress Scale (DASS-Depression) , and the depression subscale of the Hospital Anxiety and Depression Scale (HADS-Depression) , have been widely used in research and clinical practice. These instruments have been validated in the Chinese context with proven evidence of sound reliability and validity based primarily on classical test theory [10–18]. Under the IRT framework, studies conducted exclusively in Western cultures have offered good examples of comparing and linking multiple depression measures [19–25]. However, considering the existence of cultural variance in the assessments of depression [26–28], whether the aforementioned findings developed in Western populations can be applicable to the Chinese context remains unclear. Cultural differences in terms of item endorsement in these commonly used depression inventories had been noted in past studies [26, 29]. Dere et al.  for example noted that Canadian university students of Chinese heritage tended to score higher on cognitive items (e.g., past failure, worthlessness) than their European-heritage counterparts in BDI-II. However, the aforementioned studies were conducted by comparing Caucasian-heritage and Asian-heritage students and it remains unknown whether the findings could be generalized to native Chinese samples, particularly among clinically depressed samples. In addition, no studies thus far have attempted to apply IRT, a modern measurement technique, to multiple depression measures by examining item properties, measurement precision and score comparability together in the Chinese context.
Therefore, the present study attempts to fill this gap by applying IRT to measure depression through an examination of five depression measures (i.e., the BDI-II, CES-D, PHQ-9, DASS-Depression and HADS-Depression) in a clinical sample of depressed Chinese adults. Specifically, the following questions are addressed: (a) What levels of severity and discrimination are associated with the individual depressive symptoms assessed by the five measures? (b) To what extent can each of the five measures contribute to measurement precision in assessments of a full range of underlying depression levels? (c) What is the relationship between the scores from one measure and the corresponding scores from another measure? A clinical sample (N = 207) of Hong Kong Chinese outpatients seeking treatment for mood and anxiety disorders was recruited from local hospitals for this study.
Sample Characteristics (N = 207)
19 − 29
30 − 39
40 − 49
50 − 59
60 − 69
Major Depressive Disorder Only
Major Depressive Disorder with Comorbid Conditions (e.g., Anxiety Disorders)
Dysthymia with Comorbid Conditions
Other Conditions (e.g., Bipolar Disorder, Mood Disorders due to General Medical Conditions)
Measures (Diagnostic interview and self-report questionnaires including the BDI-II, CES-D, PHQ-9, DASS, and HADS)
The Structured Clinical Interview for DSM-IV-TR Axis I Disorders (SCID)  was administered to screen depressed patients. The 21-item BDI-II  was designed to assess cognitive, behavioral and somatic symptoms of depression. The CES-D is a 20-item measure designed to assess depressive symptoms in epidemiological studies focusing on the affective component of depression [6, 31, 32]. As a screening and diagnostic tool, the PHQ-9 is a nine-item instrument designed for use in primary care , on the basis of the criteria for MDD in the Diagnostic and Statistical Manual of Mental Disorders-Fourth Edition (DSM–IV) . The 21-item DASS was designed to measure three related negative emotional states–depression, anxiety and tension/stress . The HADS was developed to assess anxiety and depression in medical patients  with the exclusion of somatic symptoms (e.g., sleep disturbance) in order to avoid confounding psychological symptoms with disease or treatment. The Chinese versions of these measures were demonstrated sound reliability and validity for use with Chinese populations [10, 11, 13–16, 18].
Participants were tested individually upon providing written consent. They were invited to complete the SCID and a series of self-report depression and anxiety measurement instruments. Ethics approval was obtained from the Joint Institutional Review Board of the University of Hong Kong – Hospital Authority Hong Kong West Cluster and the Joint Chinese University of Hong Kong – New Territories East Cluster Clinical Research Ethics Committee.
Classical item analysis
Prior to fitting the IRT model, we performed classical item analysis to examine the item quality and determine the IRT model selection. At the item level, frequencies for each response category (ranging from 0 to 3), means, standard deviations and item total correlations were computed. At the scale level, means and standard deviations of observed summed scores and Cronbach’s alpha values were calculated. Items with a broad range of item total correlations indicate the need for a discrimination parameter when an IRT model is selected.
IRT assumption checking
We tested two IRT assumptions: unidimensionality and local independence. To determine essential unidimensionality, a value of 4 for the ratio of the first to the second eigenvalues is generally accepted to support unidimensionality . Further, a single-factor confirmatory factor analysis (CFA) model was employed based on polychoric correlations with a weighted least squares estimation using Mplus 6 . A single-factor CFA model was run on each measure independently to provide evidence of validity based on the internal structure. As we planned our IRT concurrent calibration on the combined item set comprising all five measures, we performed a CFA on the combined dataset. A good fit of the single-factor solution supports the unidimensionality assumption. Adequate fit is generally indicated by a comparative fit index (CFI) value above .90, a Tucker Lewis index (TLI) value above .90, and a root mean square error of approximation (RMSEA) value below .10, while very good fit is typically indicated by a CFI value above .95, a TLI value above .95 and a RMSEA value below .05 [36–39].
Next, we assessed local dependence between item pairs by using Chen and Thissen’s chi-square local dependence statistics (LD χ 2)  in IRTPRO . An LD χ 2 value of 10 or greater [40, 42] indicate local dependencies.
IRT concurrent calibration and goodness-of-fit assessment
The combined item set comprising the five measures was concurrently calibrated in a single IRT run by using the graded response model (GRM)  in MULTILOG7.03  so that the item parameter estimates were placed onto a single common metric. Further, we checked the standard errors (SEs) of the item parameter estimates to ensure that the GRM was well estimated. Average SE values for item parameters between .20 and .35 indicate good estimates . Additionally, we evaluated the degree of fit between the IRT model and the data by using Orlando and Thissen’s summed-score item-fit statistics (S-X 2) . A nonsignificant result indicates adequate model fit.
Outcome score linking and score concordances construction
Linking secures the comparability of scores across different measures and typically consists of three steps: (a) selecting a data collection design, (b) placing parameter estimates on a common metric, and (c) linking test scores. A single-group design in which each respondent was administered all five depression instruments was adopted. Concurrent calibration was performed to place parameter estimates on a common metric. IRT true score equating  was implemented in POLYEQUATE  to perform the outcome score linking and construct score concordances in order to transfer every possible summed score to a corresponding IRT-derived θ score and associate the summed scores across the five measures. Before performing the linking, we tested the linking assumptions of construct similarity, equity and subgroup invariance .
Classical item analysis
Results from classical item analysis and unidimensionality analysis of BDI-II, CES-D, PHQ-9, DASS-Depression, and HADS-Depression
Exploratory Factor Analysis (EFA)
Confirmatory Factor Analysis (CFA)
.43 – .73
.27 – .79
.61 – .79
.58 – .80
.44 – .65
.21 – .81
Correlation (in lower triangle)/Disattentuated Correlations (in upper triangle)
IRT assumption checking
For each depression measure and the combined item set, the ratio of the first to the second eigenvalues considerably exceeded 4. From the CFA, the fit statistics suggested either adequate or very good fit depending on the fit statistics referenced (Table 2). Notably, for the combined item set for which the IRT calibration was planned, the fit statistics showed very good fit (CFI = 0.95, RMSEA = 0.051, TLI = 0.95). All these results lend support to the essential unidimensional assumption.
Local independence was largely assumed, with the exception of one item pair. Between BDI-II item “Crying” and CES-D item “I had crying spells”, this item pair exhibited a LD χ 2 value slightly higher than 10 (χ 2 = 10.4), likely because the items were similar in content.
Considering that the data were essentially unidimensional and that almost all item pairs were locally independent, we considered that the data were suitable for IRT calibration and thus proceeded with the IRT analysis.
Evaluation of linking assumptions
The linking assumptions of construct similarity, equity and subgroup invariance were tested for the appropriateness of linking. To ensure that the five scales essentially measure the same or similar underlying constructs, we considered the single factor solution from the CFA and the high level of internal consistency from Cronbach’s alpha on the combined item set (α = .98) to be supporting evidence of construct similarity. To ensure that the scores of the five measures to be linked were highly correlated for the equity assumption, we computed correlations (ranging from .73 to .85) and disattenuated correlations (ranging from .85 to .93) in the pairwise observed scale scores (Table 2), indicating that the five measures were strongly correlated. In terms of the subgroup invariance assumption, the same item function relating IRT-derived θ scores and summed scores generally held across gender groups, providing support for the subgroup invariance assumption.
IRT concurrent calibration and goodness-of-fit assessment
Evaluation of estimation accuracy and model-data fit
Item content, response frequencies, IRT item parameter estimates and fit statistics
Response Frequencies (%)
Item Parameter Estimatesc
b 1 (SE)
b 2 (SE)
b 3 (SE)
Past failure (FH)
Loss of pleasure (LI)
Guilty feelings (FH)
Punishment feelings (FH)
Suicidal thoughts (SU)
Loss of interest (LI)
Loss of energy (LE)
Changes in sleep (CS)
Changes in appetite (WC)
Concentration difficulty (CD)
Tiredness or fatigue (LE)
Loss of interest in sex (LI)
Bothered by things (CD)
My appetite was poor (WC)
Couldn’t shake off blues (DM)
I am just as good as other people (FH)
I had trouble concentrating (CD)
I felt depressed (DM)
Everything I did was an effort (LE)
I felt good about the future (FH)
I thought I was a failure (FH)
I felt fearful (DM)
My sleep was restless (CS)
I was happy (DM)
I talked less than usual (PA)
I felt lonely (DM)
People were unfriendly (FH)
I enjoyed life (LI)
I had crying spells (DM)
I felt sad (DM)
I felt that people disliked me (FH)
I could not “get going.” (DM)
Little interest (LI)
Feeling down, depressed, or hopeless (DM)
Sleep disturbance (CS)
Feeling tired or having little energy (LE)
Poor appetite or overeating (WC)
Feeling bad about oneself (FH)
Trouble concentrating (CD)
Moving or speaking slowly (PA)
Thoughts of death (SU)
No positive feeling at all (DM)
No initiatives (LI)
Had nothing to look forward to (FH)
Felt down-hearted and blue (DM)
Unable to become enthusiastic (LI)
Wasn’t worth much as a person (FH)
Life was meaningless (FH)
Enjoy the things I used to enjoy (LI)
Laugh and see the funny side of things (LI)
Feel cheerful (DM)
Feel as if I am slowed down (PA)
Lost interest in my appearance (LI)
Look forward with enjoyment to things (LI)
Enjoy good book or radio/TV program (LI)
Nine items were reported to show a lack of fit, while good fit was indicated for the rest of the items (Table 3). We further examined the consequence of item misfit on the item and person parameter estimates and found that either including or excluding the nine items yielded nearly identical results. Therefore, as we considered the consequence minor and the misfit tolerable , we included all items in the outcome score linking.
Comparison of item properties across the five depression measures
The item discrimination (a) parameters (Table 3) across the five measures ranged in value from 0.36 to 3.43 (M = 1.73, SD = 0.62). Notably, items addressing depressed mood, suicidality and feelings of worthlessness provided the strongest discriminating indicators; thus, they were the most useful for discriminating among respondents with varied levels of depression. The second highly discriminating set of indicators included items on fatigue or loss of energy, psychomotor agitation or retardation, and concentration difficulties. The moderately discriminating set of indicators contained items on changes in sleep and changes in appetite. CES-D items on positive affect (i.e., “I am just as good as other people”, “I felt good about the future” and “I was happy”) had the weakest ability to distinguish respondents with varied depression levels and thus added the least information to the depression measurement. Of additional interest was the great variation in the discriminating abilities of items on loss of interest (a parameter estimates ranging from 0.87 to 2.94).
Regarding item severity (b) parameters (Table 3), symptoms pertaining to suicidality, changes in appetite, depressed mood, feelings of worthlessness and psychomotor agitation or retardation were associated with high levels of severity. Items on concentration difficulties, fatigue or loss of energy and loss of interest, followed by problems related to changes in sleep, were associated with moderate levels of severity. All of the four CES-D items on positive affect were associated with the lowest levels of severity.
Outcome score linking and score concordances construction
Comparison of cut-off theta scores across the five depression measures
Score concordances at cut-off scores of BDI-II, CES-D, PHQ-9, DASS-Depression, and HADS-Depression
IRT score (θ)b
Corresponding Summed Score in BDI-II
Corresponding Summed Score in CES-D
Corresponding Summed Score in PHQ-9
Corresponding Summed Score in DASS-Depression
Corresponding Summed Score in HADS-Depression
Risk for depression
Moderately severe depression
Extreme Severe depression
Comparison of cut-off summed scores across the five depression measures
In the same score concordances, each cut-off (observed) summed score for each measure was associated with a (observed) summed score for each of the other four measures (Table 4). Notably, the resulting cut-off scores across the five measures led to either a consistent or discrepant diagnosis for depression. For instance, the cut-off scores for mild depression on the BDI-II and the PHQ-9 were equivalent to each other, whereas the cut-off score for moderate depression on the BDI-II corresponded to the cut-off score for mild depression on the HADS-Depression.
Comparison of measurement precision across the five depression measures
This is the first study, in the Chinese context, to utilize an IRT approach to the measurement of depression through an examination of five depression measures simultaneously, namely, the BDI-II, CES-D, PHQ-9, DASS-Depression and HADS-Depression.
Psychometric properties and clinical utility of the five depression measures
The work presented herein significantly contributes to knowledge on depression measurement in the Chinese context. First, the findings from this study demonstrated that the five depression measures had sound reliability and validity for depressed Chinese adults. Our findings join previous studies [10, 11, 13–16, 18, 52–54] in providing supporting evidence of the psychometric properties of these instruments in the same context. Noticeably, CES-D reversely scored items measuring positive affect (e.g., “I am just as good as other people”, “I felt good about the future”) were found to be the least discriminating and to reflect the least severe symptoms; thus, they added little to the measurement precision of depression assessments in the studied context. Our findings echo the work of Iwata et al. , who suggested that the CES-D positive affect items with positive wording cannot adequately assess depressive disorders in the Japanese population.
This observation across cultures leads one to rethink more broadly about the role of these instruments in guiding treatment decisions. In determining treatment outcomes, remission is traditionally defined by substantial (or complete) alleviation of depressive symptoms. In the absence of apparent biological state markers for major depression, monitoring of recovery progress could only be defined phenomenologically, often times by comparing patients’ symptom severity with a predetermined diagnostic threshold or clinical cutoff scores in these well-validated depressive inventories . These conceptions, however, were challenged by recent researches advocating a broadening of the concept of remission beyond symptom resolution (e.g., [57, 58]). The new proposal concerns that multiple domains, including for example subjectively perceived functional improvement and quality of life, should also be taken into account if a holistic, patient-centered metric of recovery is considered. In light of this, more comprehensive depression instruments, such as, the Remission from Depression Questionnaire (RDQ), had been developed . From a culturally sensitive perspective, the importance of incorporating these person-centered instruments in addition to standardized depression symptom severity scales were implicated by the present findings, especially when the information is to be used in guiding treatment decisions in the practical field. This is because the benchmark of specific item endorsement on a symptom severity scale may be culturally-dependent, and patients’ perspective on remission status may provide collateral information in helping with efficacious treatment planning tailor made to individual’s needs.
Second, our findings help elucidate cultural variations in depressive symptomatology. Symptoms pertaining to psychologization, such as depressed mood, suicidality and feelings of worthlessness, served as the strongest discriminating indicators, while symptoms pertaining to somatization, such as psychomotor agitation, fatigue or loss of energy, concentration difficulties, changes in sleep and changes in appetite, were found to exhibit highly or moderately discriminating abilities. In terms of severity, symptoms related to suicidality, changes in appetite, depressed mood, and feelings of worthlessness appeared to reflect a high level of severity in the Chinese clinical sample. The findings of the present study share some consistencies with those from previous studies. For instance, suicidality and changes in appetite also emerged at a high level of severity in Western contexts [2, 60]. However, discrepancies do exist. The symptom of feelings of worthlessness was ranked at a relatively low level of severity in the Western context , while the same symptom appeared to be rated at a relatively high level of severity by the Chinese outpatients in our study. Similarly, the high level of severity and high discriminating ability of feelings of worthlessness in our findings are in accordance with Saito et al.’s work conducted on a Japanese community sample . Intriguingly, recent work also showed that the cognitive component of negative self-evaluation is an important factor that differentiates reports of depressive symptomatology between Asian and Western youths . The salience of a heightened sense of self-worth may be related to a deep-rooted Confucian value among Asian Chinese, where a person’s intrinsic value is highly dependent on how well the person meets social expectations in serving the collective interest of the social group. Furthermore, a loss of functioning resulting from depression, especially in its severe form, may bring about intense shameful feelings and self-doubt, which further exacerbates a negative vicious cycle of affective-cognitive disruptions.
Intriguingly, a closely related observation is that several items that demonstrated misfit seemed to be associated with a systematic symptom theme. For example, the items “Guilty feeling” and “Punishment feeling” in BDI-II; the items “I am just as good as other people” and “I felt good about the future” in CES-D; and the item “Feeling bad about oneself” in PHQ-9, all loaded onto the same “Feelings of hopelessness” (FH) theme. These items reflect a strong sense of responsibility and echo with the cultural belief that a person’s value should be closely linked with the social roles that one is expected to perform in collectivistic societies. It would be interesting to test if the same pattern of misfitting items would be observed in individualistic cultures in future studies.
Third, the findings on the item and test information offer valuable information regarding how each item/symptom and each measure reliably/precisely assess depression at varied levels. Though they may share similarities in discrimination parameters, items may vary in precision/usefulness for assessing varied levels of depression. For instance, between two DASS-Depression items with similar a values, “No positive feeling at all” was more useful for assessing mild and moderate depression, whereas “Felt down-hearted and blue” was more informative for assessing moderate to extreme severe depression. The finding in this example helps us better understand the gradient of affective dysregulation experienced by sufferers of depression and suggests that a loss of positive affect may precede, or interactively exacerbate, the experience of intensive negative affect in the course of depression.
At the scale level, the findings showed that the five depression measures contributed in various degrees to measurement precision along the full range of the underlying depression levels, providing insight into instrument selection. Specifically, in the studied context, the BDI-II and the CES-D were informative on a wider range of depression levels and had greater measurement precision than the other three measures. The PHQ-9 and the DASS-Depression were particularly useful for assessing depression in clinical populations, as the former was informative for measuring depression ranging from mild to severe and latter was informative for assessing depression ranging from moderate to extreme severe. Accordingly, clinicians can choose the measure that is the most useful/precise for assessing a specific level of depressive severity at the patient level in either clinical or epidemiological populations. Notably, the HADS-Depression appeared to be the least informative for assessing depression in the Chinese context, based on the observation that moderate or low discrimination parameter estimates were reported on the majority of items in this scale.
Our pattern of score concordances results echoes previous studies in suggesting that commonly used depression scales seemed to differ in their diagnoses for depression severity. Zimmerman and colleagues [63, 64], for example, administered Hamilton Depression Rating Scale (HRDS), PHQ-9, as well as Clinically Useful Depression Outcome Scale (CUDOS) and Quick Inventory of Depressive Symptomatology (QIDS), to a group of clinically depressed patients and compared the diagnostic outcomes as indicated by the reported scores in each case. The authors noted significant variance in the distribution of patients being classified into discrete levels of severity categories when different scales were used. The level of disagreement implied that treatment planning solely based on data collected from a single self-report scale may be over-inclusive, despite that these scales were all well-validated and standardized.
Finally, the clinical values of the score concordances reported herein are worth highlighting following from the previous point. With scores obtained from the administration of one depression measure, one can use the concordance table to locate the corresponding scores on other depression measures without administering them. Clinicians can then determine depression diagnoses for individual respondents on the basis of the cut-off scores for these rating scales and other interview-based assessments. Further, scores across the five measures are not only aligned with each other in the observed score metric but also mapped to the IRT scores at the θ metric. Such mapping offers clinical meanings for the arbitrary θ metric. For instance, respondents who scored 0.47 or above (at θ metric) on the BDI-II are likely to be diagnosed as severely depressed. Clinicians can then refer to the item information function curves to locate the symptoms that are more informative for assessing this restricted range of severe depression.
Advantages of the methodology
The methodology used in this in study has several remarkable strengths. First, we followed a single-group design for the outcome score linking. Such a design directly controls for differences in response propensities because the instruments are administered to the same respondents . Additionally, we used concurrent calibration, which is less time-consuming and produces more stable results than separate calibration . Second, we tested the linking assumptions. Such a practice deserves more attention, and it is strongly encouraged in studies on linking PRO measures to ensure the validity of the inferences drawn from the score concordances. Finally, instead of relying solely on chi-square-like IRT fit statistics, which can be sensitive to sample size, we evaluated IRT item misfit by focusing on the consequences of using misfitting items and item statistics associated with them, a strategy strongly recommended by Hambleton and Han  and Zhao . We hope that future studies adopting a rigorous approach to addressing methodological issues are encouraged in order to promote the quality of PRO research and to ensure the appropriate application of IRT models.
Limitations and future directions
The major limitations and future directions of the present study are discussed below. First, a convenient sampling approach was used to recruit participants because of practical restrictions, which limits the representativeness of the sample and the generalizability of the results. A related issue is the unbalanced gender ratio, which limits the power of using statistical tests to examine gender differences. Second, the outcome score linking function/relationship established in the study may be sensitive to population differences , and only one linking approach was used in this study. It would seem prudent to evaluate the robustness of the linking relationship across different samples (e.g., in Chinese nonclinical samples) and across multiple linking approaches (e.g., both IRT-based and non-IRT-based approaches). Additionally, whether the invariance of item parameters holds across clinical and nonclinical populations also requires further investigation. With additional sets of larger clinical and epidemiological samples, a more robust item bank and score concordances can be established. Third, the present study did not incorporate other patient-centered instruments in assessing perceived remission status for comparison purpose. As mentioned previously, these patient-centered instruments were informative in defining depression remission with reference to symptom severity and it would be useful to take into account, as well as to explore the potential merits of, these instruments. Future studies could consider including the Remission from Depression Questionnaire [58, 66] and/or the Remission Evaluation and Mood Inventory Tool [67, 68] as examples. Furthermore, it would be useful to conduct follow-up studies with large samples to cross-reference the depression scales with interview-based clinical diagnostic tools relating to depressive symptomatology. Finally, the five depression measures covered in the study have all been developed in Western cultures, although the Chinese versions of these measures have been demonstrated to have sound psychometric properties. Nonetheless, the cut-off scores for depression diagnosis that have been suggested based on the Western context deserve further validation in the Eastern context.
Based on an examination of five depression measures, the findings of the present study demonstrated (a) levels of severity and discrimination for individual depressive symptoms, (b) measurement precision for each measure at varied levels of depression, and (c) the comparability of severity cut-off scores across the five measures. The study provides additional evidence regarding the psychometric properties and clinical utility of the PRO measures, offers methodological contributions to the appropriate use of IRT models in PRO measures, and, more importantly, enhances our understanding of cultural variation and depressive symptomatology.
Beck depression inventory
Center for epidemiologic studies depression scale
Confirmatory factor analysis
Comparative fit index
Depression, anxiety and stress scale
Graded response model
Hospital anxiety and depression scale
Item response theory
Major depressive disorder
Patient Health Questionnaire
Root mean square error of approximation
Structured clinical interview
Tucker Lewis index
We wish to thank Dr. Jason Chan for his kind assistance in patient recruitment.
This research was funded through University of Hong Kong seed funding for basic research (project no: 201211159116).
Availability of data and materials
The datasets analysed in the current study are available from the corresponding author on reasonable request
YZ conducted the study design, performed the data analyses, and was a major contributor in writing and revising the manuscript. WC contributed to the study design, data analyses and manuscript revision. BCYL contributed to study design, data collection, clinical consultation, manuscript writing and revision, funding support and overall project management. All authors read and approved the final manuscript.
Dr. Yue Zhao is currently Director of the Teaching and Learning Evaluation and Measurement Unit at The University of Hong Kong. She earned doctorate in Psychometrics from University of Massachusetts, and her research interests lie broadly in the advancement and application of quantitative methods in the social and health sciences. Her work has been published at Encyclopedia of Statistics in Behavioral Science and referred journals such as Assessment and Quality of Life Research.
Professor Wai Chan is Associate Professor in the Department of Psychology at The Chinese University of Hong Kong. He obtained his Ph.D. in Quantitative Psychology from University of California, Los Angeles, and has published widely in peer reviewed journals including Psychological Methods, Structural Equation Modeling, Psychometrika and Behavior Research Methods etc.
Dr. Barbara Chuen Yee Lo is Assistant Professor at Department of Psychology, The University of Hong Kong. She received her Master of Social Science (Clinical Psychology) degree at the Chinese University of Hong Kong and obtained her PhD degree at the University of Melbourne in Australia. She is a Registered Psychologist in Hong Kong and Australia. Her research interest is on affective dysregulation and had published widely in international journals such as Health Psychology Review, Psychiatry Research and Assessment etc.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
The study was conducted in accordance with the Declaration of Helsinki. Ethics approval was obtained from the Joint Institutional Review Board of the University of Hong Kong – Hospital Authority Hong Kong West Cluster and the Joint Chinese University of Hong Kong – New Territories East Cluster Clinical Research Ethics Committee. Written consent was obtained from participants who took part in the study.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Cella D, Hahn EA, Jensen SE, Butt Z, Nowinski CJ, Rothrock N. Methodological issues in the selection, administration, and use of patient-reported outcomes in performance measurement in health care setting. National Quality Forum. 2012. https://www.qualityforum.org/Projects/n-r/Patient-Reported_Outcomes/Patient-Reported_Outcomes.aspx. Accessed 08 Dec 2015.
- Cole DA, Cai L, Martin NC, Findling RL, Youngstrom EA, Garber J, et al. Structure and measurement of depression in youths: applying item response theory to clinical data. Psychol Assess. 2011. doi:10.1037/a0023518.PubMedPubMed CentralGoogle Scholar
- Hambleton RK, Swaminathan H, Rogers HJ. Fundamentals of item response theory. Newbury Park: Sage; 1991.Google Scholar
- Moussavi S, Chatterji S, Verdes E, Tandon A, Patel V, Ustun B. Depression, chronic diseases, and decrements in health: results from the World Health Surveys. Lancet. 2007. doi:10.1016/s0140-6736(07)61415-9.PubMedGoogle Scholar
- Beck AT, Steer RA, Ball R, Ranieri WF. Comparison of Beck Depression Inventories-IA and-II in psychiatric outpatients. J Pers Assess. 1996. doi:10.1207/s15327752jpa6703_13.PubMedGoogle Scholar
- Radloff LS. The CES-D scale: a self-report depression scale for research in the general population. Appl Psychol Meas. 1977. doi:10.1177/014662167700100306.Google Scholar
- Kroenke K, Spitzer RL, Williams JB. The PHQ‐9. J Gen Intern Med. 2001. doi:10.1046/j.1525-1497.2001.016009606.x.Google Scholar
- Lovibond PF, Lovibond SH. The structure of negative emotional states: comparison of the Depression Anxiety Stress Scales (DASS) with the Beck depression and anxiety inventories. Behav Res and Ther. 1995. doi:10.1016/0005-7967(94)00075-u.Google Scholar
- Zigmond AS, Snaith RP. The hospital anxiety and depression scale. Acta Psychiat Scand. 1983. doi:10.1111/j.1600-0447.1983.tb09716.x.PubMedGoogle Scholar
- Chan DW. Coping with depressed mood among Chinese medical students in Hong Kong. J Affect Disord. 1992. doi:10.1016/0165-0327(92)90025-2.PubMedGoogle Scholar
- Chan RC, Xu T, Huang J, Wang Y, Zhao Q, Shum DH, et al. Extending the utility of the Depression Anxiety Stress scale by examining its psychometric properties in Chinese settings. Psychiatry Res. 2012. doi:10.1016/j.psychres.2012.06.041.Google Scholar
- Lai BP, Tang AK, Lee DT, Yip AS, Chung TK. Detecting postnatal depression in Chinese men: a comparison of three instruments. Psychiatry Res. 2010. doi:10.1016/j.psychres.2009.07.015.Google Scholar
- Leung CM, Ho S, Kan CS, Hung CH, Chen CN. Evaluation of the Chinese version of the hospital anxiety and depression scale: a cross-cultural perspective. Int J Psychosom. 1993;40:29–34.PubMedGoogle Scholar
- Shek DT. Reliability and factorial structure of the Chinese version of the Beck Depression Inventory. J Clin Psychol. 1990. doi:10.1002/1097-4679(199001)46:1<35::aid-jclp2270460106>3.0.co;2-w.PubMedGoogle Scholar
- Shek DT. What does the Chinese version of the Beck Depression Inventory measure in Chinese students–general psychopathology or depression? J Clin Psychol. 1991. doi:10.1002/1097-4679(199105)47:3<381::aid-jclp2270470309>3.0.co;2-d.PubMedGoogle Scholar
- Taouk M, Lovibond P, Laube R. Psychometric Properties of a Chinese Version of the 21-item Depression Anxiety Stress Scales (DASS21). 2001. http://www2.psy.unsw.edu.au/dass/Chinese/Chinese%20DASS21%20Paper.pdf. Accessed 14 Nov 2016.Google Scholar
- Wong WS, Chen PP, Yap J, Mak KH, Tam BKH, Fielding R. Assessing depression in patients with chronic pain: a comparison of three rating scales. J Affect Disord. 2011. doi:10.1016/j.jad.2011.04.012.Google Scholar
- Ying YW. Depressive symptomatology among Chinese‐Americans as measured by the CES‐D. J Clin Psychol. 1988. doi:10.1002/1097-4679(198809)44:5<739::aid-jclp2270440512>3.0.co;2-0.PubMedGoogle Scholar
- Choi SW, Schalet B, Cook KF, Cella D. Establishing a common metric for depressive symptoms: Linking the BDI-II, CES-D, and PHQ-9 to PROMIS Depression. Psychol Assess. 2014. doi:10.1037/a0035768.PubMedGoogle Scholar
- Fischer HF, Tritt K, Klapp BF, Fliege H. How to compare scores from different depression scales: Equating the Patient Health Questionnaire (PHQ) and the ICD‐10‐Symptom Rating (ISR) using Item Response Theory. Int J Meth Psychiatr Res. 2011. doi:10.1002/mpr.350.Google Scholar
- Gibbons LE, Feldman BJ, Crane HM, Mugavero M, Willig JH, Patrick D, et al. Migrating from a legacy fixed-format measure to CAT administration: Calibrating the PHQ-9 to the PROMIS depression measures. Qual Life Res. 2011. doi:10.1007/s11136-011-9882-y.PubMedPubMed CentralGoogle Scholar
- Gibbons LE, Feldman BJ, Crane HM, Mugavero M, Willig JH, Patrick D, et al. Erratum to: migrating from a legacy fixed-format measure to CAT administration: calibrating the PHQ-9 to the PROMIS depression measures. Qual Life Res. 2013. doi:10.1007/s11136-012-0313-5.Google Scholar
- Olino TM, Yu L, Klein DN, Rohde P, Seeley JR, Pilkonis PA, et al. Measuring depression using item response theory: an examination of three measures of depressive symptomatology. Int J Meth Psychiatr Res. 2012. doi:10.1002/mpr.1348.Google Scholar
- Olino TM, Yu L, McMakin DL, Forbes EE, Seeley JR, Lewinsohn PM, et al. Comparisons across depression assessment instruments in adolescence and young adulthood: An item response theory study using two linking methods. J Abnorm Child Psychol. 2013. doi:10.1007/s10802-013-9756-6.Google Scholar
- Wahl I, Löwe B, Bjorner JB, Fischer F, Langs G, Voderholzer U, et al. Standardization of depression measurement: a common metric was developed for 11 self-report depression measures. J Clin Epidemiol. 2014. doi:10.1016/j.jclinepi.2013.04.019.PubMedGoogle Scholar
- Dere J, Watters CA, Yu SCM, Bagby RM, Ryder AG, Harkness KL. Cross-cultural examination of measurement invariance of the Beck Depression Inventory–II. Psychol Assess. 2015. doi:10.1037/pas0000026.PubMedGoogle Scholar
- Kalibatseva Z, Leong FT. Depression among Asian Americans: review and recommendations. Depress Res Treat. 2011. doi:10.1155/2011/320902.PubMedPubMed CentralGoogle Scholar
- Leong FT, Okazaki S, Tak J. Assessment of depression and anxiety in East Asia. Psychol Assess. 2003. doi:10.1037/1040-35184.108.40.2060.Google Scholar
- Parker G, Chan B, Tully L. Recognition of depressive symptoms by Chinese subjects: the influence of acculturation and depressive experience. J Affect Disord. 2006. doi:10.1016/j.jad.2006.03.002.Google Scholar
- First MB, Spitzer RL, Gibbon M, Williams JBW. Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Research Version, Non-patient Edition (SCID-I/NP). New York: Biometrics Research, New York State Psychiatric Institute; 2002.Google Scholar
- Naughton MJ, Wiklund I. A critical review of dimension-specific measures of health-related quality of life in cross-cultural research. Qual Life Res. 1993. doi:10.1007/bf00422216.PubMedGoogle Scholar
- Radloff LS, Locke BZ. The community mental health assessment survey and the CES-D scale. In: Weissman MM, Myers JK, Ross CE, editors. Community surveys of psychiatric disorders. New Brunswick: Rutgers University Press; 1986. p. 177–89.Google Scholar
- American Psychiatric Association. Diagnostic and statistical manual of mental disorders. 4th ed. Washington, DC: American Psychiatric Press; 1994. http://dx.doi.org/10.1017/s0033291700035765.Google Scholar
- Reise SP, Waller NG. Fitting the two-parameter model to personality data. Appl Psychol Meas. 1990. doi:10.1177/014662169001400105.Google Scholar
- Muthén LK, Muthén BO. Mplus. Version 4 [computer software]. Los Angeles: Muthén & Muthén; 2006.Google Scholar
- Bentler PM, Bonett DG. Significance tests and goodness of fit in the analysis of covariance structures. Psychol Bull. 1980. doi:10.1037/0033-2909.88.3.588.Google Scholar
- Browne MW, Cudeck R. Alternative ways of assessing model fit. Sociol Methods Res. 1993. doi:10.1177/0049124192021002005.Google Scholar
- Hu LT, Bentler PM. Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Struct Equ Modeling. 1999. doi:10.1080/10705519909540118.Google Scholar
- Lance CE, Butts MM, Michels LC. The sources of four commonly reported cutoff criteria what did they really say? Organ Res Meth. 2006. doi:10.1177/1094428105284919.Google Scholar
- Chen WH, Thissen D. Local dependence indexes for item pairs using item response theory. J Educ Behav Stat. 1997. doi:10.3102/10769986022003265.Google Scholar
- Cai L, Thissen D, du Toit S. IRTPRO. Version 2.01 [computer software]. Lincolnwood: Scientific Software International; 2011.Google Scholar
- Liu Y, Thissen D. Identifying local dependence with a score test statistic based on the bifactor logistic model. Appl Psychol Meas. 2012. doi:10.1177/0146621612458174.PubMedPubMed CentralGoogle Scholar
- Samejima F. Estimation of latent ability using a response pattern of graded scores. Chicago: Psychometric Society; 1969. http://dx.doi.org/10.1002/j.2333-8504.1968.tb00153.x.Google Scholar
- Thissen D, Chen W-H, Bock RD. MULTILOG. Version 7.03 [computer software]. Lincolnwood: Scientific Software International; 2003.Google Scholar
- Tay L, Meade AW, Cao M. An overview and practical guide to IRT measurement equivalence analysis. Organ Res Meth. 2015. doi:10.1177/1094428114553062.Google Scholar
- Orlando M, Thissen D. Likelihood-based item-fit indices for dichotomous item response theory models. Appl Psychol Meas. 2000. doi:10.1177/01466216000241003.Google Scholar
- Kolen MJ, Brennan RL. Test equating, scaling, and linking. New York: Springer; 2004. p. 201–5. http://dx.doi.org/10.1007/978-1-4757-4310-4.View ArticleGoogle Scholar
- Kolen MJ. POLYEQUATE: a computer program for IRT true and observed scoring equating for dichotomously and polytomously scored tests [computer software]. Iowa: Iowa Testing Programs, University of Iowa; 2004.Google Scholar
- Dorans NJ. Linking scores from multiple health outcome instruments. Qual Life Res. 2007. doi:10.1007/s11136-006-9155-3.PubMedGoogle Scholar
- Zhao Y. Impact of IRT item misfit on score estimates and severity classifications: an examination of PROMIS depression and pain interference item banks. Qual Life Res. 2017. doi:10.1007/s11136-016-1467-3.Google Scholar
- Cappelleri JC, Lundy JJ, Hays RD. Overview of classical test theory and item response theory for the quantitative assessment of items in developing patient-reported outcomes measures. Clin Ther. 2014. doi:10.1016/j.clinthera.2014.04.006.PubMedPubMed CentralGoogle Scholar
- Tang WK, Wong E, Chiu HFK, Lum CM, Ungvari GS. Examining item bias in the anxiety subscale of the Hospital Anxiety and Depression Scale in patients with chronic obstructive pulmonary disease. Int J Meth Psychiatr Res. 2008. doi:10.1002/mpr.234.Google Scholar
- Tang WK, Wong E, Chiu HFK, Ungvari GS. Rasch analysis of the scoring scheme of the HADS Depression subscale in Chinese stroke patients. Psychiatry Res. 2007. doi:10.1016/j.psychres.2006.01.015.PubMed CentralGoogle Scholar
- Wu PC, Chang L. Psychometric properties of the Chinese version of the Beck Depression Inventory-II using the Rasch model. Meas Eval Couns Dev. 2008;41:13.Google Scholar
- Iwata N, Umesue M, Egashira K, Hiro H, Mizoue T, Mishima N, et al. Can positive affect items be used to assess depressive disorders in the Japanese population? Psychol Med. 1998. doi:10.1017/s0033291797005898.PubMedGoogle Scholar
- Zimmerman M, Mcglinchey JB, Posternak MA, Friedman M, Attiullah N, Boerescu D. How should remission from depression be defined? The depressed patient’s perspective. Am J Psychiatry. 2006. doi:10.1176/appi.ajp.163.1.148.Google Scholar
- Zimmerman M, Mcglinchey JB, Posternak MA, Friedman M, Boerescu D, Attiullah N. Remission in depressed outpatients: more than just symptom resolution? J Psychiatr Res. 2008. doi:10.1016/j.jpsychires.2007.09.004.PubMed CentralGoogle Scholar
- Trujols J, Portella MJ, Pérez V. Toward a genuinely patient-centered metric of depression recovery: one step further. JAMA Psychiatry. 2013. doi:10.1001/jamapsychiatry.2013.2187.PubMedGoogle Scholar
- Zimmerman M, Martinez JH, Attiullah N, Friedman M, Toba C, Boerescu DA, et al. A new type of scale for determining remission from depression: the remission from depression questionnaire. J Psychiatr Res. 2013. doi:10.1016/j.jpsychires.2012.09.006.Google Scholar
- Simon GE, Goldberg DP, Von Korff M, Üstün TB. Understanding cross-national differences in depression prevalence. Psychol Med. 2002. doi:10.1017/s0033291702005457.Google Scholar
- Saito M, Iwata N, Kawakami N, Matsuyama Y, Ono Y, Nakane Y, et al. Evaluation of the DSM‐IV and ICD‐10 criteria for depressive disorders in a community population in Japan using item response theory. Int J Meth Psychiatr Res. 2010. doi:10.1002/mpr.320.Google Scholar
- Lo BCY, Zhao Y, Kwok AWY, Chan W, Chan CKY. Evaluation of the psychometric properties of the Asian adolescent depression scale and construction of a short form: an item response theory analysis. Assess. 2015. doi:10.1177/1073191115614393.Google Scholar
- Zimmerman M, Martinez JH, Friedman M, Boerescu DA, Attiullah N, Toba C. How can we use depression severity to guide treatment selection when measures of depression categorize patients differently? J Clin Psychiatry. 2012. doi:10.4088/JCP.12m07775.PubMed CentralGoogle Scholar
- Zimmerman M, Martinez JH, Friedman M, Boerescu DA, Attiullah N, Toba C. Speaking a more consistent language when discussing severe depression: a calibration study of 3 self-report measures of depressive symptoms. J Clin Psychiatry. 2014. doi:10.4088/JCP.13m08458.Google Scholar
- Hambleton RK, Han N. Assessing the fit of IRT models to educational and psychological test data: a five step plan and several graphical displays. In: Lenderking WR, Revicki D, editors. Advances in health outcomes research methods, measurement, statistical analysis, and clinical applications. Washington: Degnon Associates; 2005. p. 57–78.Google Scholar
- Zimmerman M, Martinez JH, Attiullah N, Friedman M, Toba C, Boerescu DA. The remission from depression questionnaire as an outcome measure in the treatment of depression. Depress Anxiety. 2014. doi:10.1002/da.22178.PubMedGoogle Scholar
- Nease DE, Aikens JE, Klinkman MS, Kroenke K, Sen A. Toward a more comprehensive assessment of depression remission: the Remission Evaluation and Mood Inventory Tool (REMIT). Gen Hosp Psychiatry. 2011. doi:10.1016/j.genhosppsych.2011.03.002.PubMedGoogle Scholar
- Aikens JE, Klinkman MS, Sen A, Nease DE. Improving the assessment of depression remission with the Remission Evaluation and Mood Inventory Tool. Int J Psychiatry Med. 2015. doi:10.1177/0091217415612734.PubMedGoogle Scholar
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B. 1995. doi:10.2307/2346101.Google Scholar
- Baker FB. The basics of item response theory. 2001. http://files.eric.ed.gov/fulltext/ED458219.pdf. Accessed 8 Dec 2015.