Comparing five depression measures in depressed Chinese patients using item response theory: an examination of item properties, measurement precision and score comparability

Background Item response theory (IRT) has been increasingly applied to patient-reported outcome (PRO) measures. The purpose of this study is to apply IRT to examine item properties (discrimination and severity of depressive symptoms), measurement precision and score comparability across five depression measures, which is the first study of its kind in the Chinese context. Methods A clinical sample of 207 Hong Kong Chinese outpatients was recruited. Data analyses were performed including classical item analysis, IRT concurrent calibration and IRT true score equating. The IRT assumptions of unidimensionality and local independence were tested respectively using confirmatory factor analysis and chi-square statistics. The IRT linking assumptions of construct similarity, equity and subgroup invariance were also tested. The graded response model was applied to concurrently calibrate all five depression measures in a single IRT run, resulting in the item parameter estimates of these measures being placed onto a single common metric. IRT true score equating was implemented to perform the outcome score linking and construct score concordances so as to link scores from one measure to corresponding scores on another measure for direct comparability. Results Findings suggested that (a) symptoms on depressed mood, suicidality and feeling of worthlessness served as the strongest discriminating indicators, and symptoms concerning suicidality, changes in appetite, depressed mood, feeling of worthlessness and psychomotor agitation or retardation reflected high levels of severity in the clinical sample. (b) The five depression measures contributed to various degrees of measurement precision at varied levels of depression. (c) After outcome score linking was performed across the five measures, the cut-off scores led to either consistent or discrepant diagnoses for depression. Conclusions The study provides additional evidence regarding the psychometric properties and clinical utility of the five depression measures, offers methodological contributions to the appropriate use of IRT in PRO measures, and helps elucidate cultural variation in depressive symptomatology. The approach of concurrently calibrating and linking multiple PRO measures can be applied to the assessment of PROs other than the depression context.


Background
With a growing emphasis on patient-centered care, the recent surge in the use of high-quality data from psychometrically sound patient-reported outcome (PRO) measures has engendered the opportunity to use PROs to inform healthcare practices and guide healthcare decision making. In a commissioned paper by the U.S. National Quality Forum on the issues to consider when evaluating PROs as candidate performance measures in healthcare settings, Cella et al. [1] remarked on several methodological issues related to the use of PROs in patient-centered outcome research. One issue focuses on establishing standardized metrics and deriving comparable scores across different PRO measures of the same construct to facilitate direct comparisons between PROs. In addition, the authors highlighted a number of PRO characteristics to consider when selecting appropriate PROs. Measurement precision was among the most important characteristics, as PRO measures with greater measurement precision appear to show greater sensitivity to change [1]. PRO measures not only have great potential to be integrated into healthcare practice but also substantially contribute to elucidating the properties of symptoms directly reported by patients (see for example [2]).
In response to the aforementioned methodological issues, item response theory (IRT) [3] offers promising solutions to address issues that have been difficult to solve through classical methods, and recently, IRT has been increasingly applied to PRO measures. In comparison with classical test theory, IRT offers a number of benefits. First, the application of IRT in examinations of item properties (items can be considered symptoms) adds knowledge regarding the level of severity and discriminating abilities of various symptoms. Such knowledge is of particular clinical interest for assessing symptomatology, as some items may hold higher discriminatory power for differentiating varied levels of clinical latent traits, while other items may reflect more severe symptoms. Second, comparisons from IRTderived test information functions and their associated standard errors of measurement yield useful information about the contribution of different measures to measurement precision along the latent trait continuum. Clinicians can then determine the most useful and precise measures for assessing a specific level/range of the latent trait in either clinical or epidemiological populations. Third, IRT allows for a common metric on which the item parameters of multiple measures can be placed, and hence, score concordances can be constructed to link scores from one measure to corresponding scores on another measure, in order to facilitate direct comparability across measures. Clinicians can then further investigate whether the conventional cut-off scores on different measures lead to a convergent or divergent solution for clinical and epidemiological decision making.
Major depressive disorder (MDD) is among the significant causes of disease burden worldwide [4]. Regarding the measures of depressive symptomatology, to date, several well developed and carefully validated PRO measures, such as the Beck Depression Inventory-II (BDI-II) [5], the Center for Epidemiologic Studies Depression Scale (CES-D) [6], the Patient Health Questionnaire (PHQ-9) [7], the depression subscale of the Depression, Anxiety and Stress Scale (DASS-Depression) [8], and the depression subscale of the Hospital Anxiety and Depression Scale (HADS-Depression) [9], have been widely used in research and clinical practice. These instruments have been validated in the Chinese context with proven evidence of sound reliability and validity based primarily on classical test theory [10][11][12][13][14][15][16][17][18]. Under the IRT framework, studies conducted exclusively in Western cultures have offered good examples of comparing and linking multiple depression measures [19][20][21][22][23][24][25]. However, considering the existence of cultural variance in the assessments of depression [26][27][28], whether the aforementioned findings developed in Western populations can be applicable to the Chinese context remains unclear. Cultural differences in terms of item endorsement in these commonly used depression inventories had been noted in past studies [26,29]. Dere et al. [26] for example noted that Canadian university students of Chinese heritage tended to score higher on cognitive items (e.g., past failure, worthlessness) than their European-heritage counterparts in BDI-II. However, the aforementioned studies were conducted by comparing Caucasian-heritage and Asian-heritage students and it remains unknown whether the findings could be generalized to native Chinese samples, particularly among clinically depressed samples. In addition, no studies thus far have attempted to apply IRT, a modern measurement technique, to multiple depression measures by examining item properties, measurement precision and score comparability together in the Chinese context. Therefore, the present study attempts to fill this gap by applying IRT to measure depression through an examination of five depression measures (i.e., the BDI-II, CES-D, PHQ-9, DASS-Depression and HADS-Depression) in a clinical sample of depressed Chinese adults. Specifically, the following questions are addressed: (a) What levels of severity and discrimination are associated with the individual depressive symptoms assessed by the five measures? (b) To what extent can each of the five measures contribute to measurement precision in assessments of a full range of underlying depression levels? (c) What is the relationship between the scores from one measure and the corresponding scores from another measure? A clinical sample (N = 207) of Hong Kong Chinese outpatients seeking treatment for mood and anxiety disorders was recruited from local hospitals for this study.

Sample
In the original sample, 207 Hong Kong Chinese outpatients seeking treatment for mood and anxiety disorders in Hong Kong public hospitals were invited to participate in the study. Those who were suffering from psychotic or developmental disorders at the time of testing were excluded. The sample comprised 42 males (20.3%) and 165 females (79.7%) ranging in age from 19 to 69 years (M = 45.7 years, SD = 10.8). Detailed sample characteristics are reported in Table 1. Among the 207 respondents, all participants (100%) completed the BDI-II, the DASS-Depression, and the HADS-Depression, 204 out of the 207 respondents (98.6%) completed the PHQ-9, and 199 out of the 207 respondents (96.1%) completed the CES-D. No data on the completed measures were missing. Measures (Diagnostic interview and self-report questionnaires including the BDI-II, CES-D, PHQ-9, DASS, and HADS) The Structured Clinical Interview for DSM-IV-TR Axis I Disorders (SCID) [30] was administered to screen depressed patients. The 21-item BDI-II [5] was designed to assess cognitive, behavioral and somatic symptoms of depression. The CES-D is a 20-item measure designed to assess depressive symptoms in epidemiological studies focusing on the affective component of depression [6,31,32]. As a screening and diagnostic tool, the PHQ-9 is a nine-item instrument designed for use in primary care [7], on the basis of the criteria for MDD in the Diagnostic and Statistical Manual of Mental Disorders-Fourth Edition (DSM-IV) [33]. The 21-item DASS was designed to measure three related negative emotional states-depression, anxiety and tension/stress [8]. The HADS was developed to assess anxiety and depression in medical patients [9] with the exclusion of somatic symptoms (e.g., sleep disturbance) in order to avoid confounding psychological symptoms with disease or treatment. The Chinese versions of these measures were demonstrated sound reliability and validity for use with Chinese populations [10,11,[13][14][15][16]18].

Procedure
Participants were tested individually upon providing written consent. They were invited to complete the SCID and a series of self-report depression and anxiety measurement instruments. Ethics approval was obtained from the Joint Institutional Review Board of the University of Hong Kong -Hospital Authority Hong Kong West Cluster and the Joint Chinese University of Hong Kong -New Territories East Cluster Clinical Research Ethics Committee.

Statistical analysis Classical item analysis
Prior to fitting the IRT model, we performed classical item analysis to examine the item quality and determine the IRT model selection. At the item level, frequencies for each response category (ranging from 0 to 3), means, standard deviations and item total correlations were computed. At the scale level, means and standard deviations of observed summed scores and Cronbach's alpha values were calculated. Items with a broad range of item total correlations indicate the need for a discrimination parameter when an IRT model is selected. IRT assumption checking We tested two IRT assumptions: unidimensionality and local independence. To determine essential unidimensionality, a value of 4 for the ratio of the first to the second eigenvalues is generally accepted to support unidimensionality [34]. Further, a single-factor confirmatory factor analysis (CFA) model was employed based on polychoric correlations with a weighted least squares estimation using Mplus 6 [35]. A single-factor CFA model was run on each measure independently to provide evidence of validity based on the internal structure. As we planned our IRT concurrent calibration on the combined item set comprising all five measures, we performed a CFA on the combined dataset. A good fit of the singlefactor solution supports the unidimensionality assumption. Adequate fit is generally indicated by a comparative fit index (CFI) value above .90, a Tucker Lewis index (TLI) value above .90, and a root mean square error of approximation (RMSEA) value below .10, while very good fit is typically indicated by a CFI value above .95, a TLI value above .95 and a RMSEA value below .05 [36][37][38][39]. Next, we assessed local dependence between item pairs by using Chen and Thissen's chi-square local dependence statistics (LD χ 2 ) [40] in IRTPRO [41]. An LD χ 2 value of 10 or greater [40,42] indicate local dependencies.

IRT concurrent calibration and goodness-of-fit assessment
The combined item set comprising the five measures was concurrently calibrated in a single IRT run by using the graded response model (GRM) [43] in MULTI-LOG7.03 [44] so that the item parameter estimates were placed onto a single common metric. Further, we checked the standard errors (SEs) of the item parameter estimates to ensure that the GRM was well estimated. Average SE values for item parameters between .20 and .35 indicate good estimates [45]. Additionally, we evaluated the degree of fit between the IRT model and the data by using Orlando and Thissen's summed-score item-fit statistics (S-X 2 ) [46]. A nonsignificant result indicates adequate model fit.
Outcome score linking and score concordances construction Linking secures the comparability of scores across different measures and typically consists of three steps: (a) selecting a data collection design, (b) placing parameter estimates on a common metric, and (c) linking test scores. A single-group design in which each respondent was administered all five depression instruments was adopted. Concurrent calibration was performed to place parameter estimates on a common metric. IRT true score equating [47] was implemented in POLYEQUATE [48] to perform the outcome score linking and construct score concordances in order to transfer every possible summed score to a corresponding IRT-derived θ score and associate the summed scores across the five measures. Before performing the linking, we tested the linking assumptions of construct similarity, equity and subgroup invariance [49].

Classical item analysis
The wide range of observed summed scores for each measure ( Table 2) ensured good coverage of the whole spectrum of depression levels ranging from low to high. Cronbach's alpha values (ranging from .82 to .92) across the five measures and the overall alpha for the combined item set (α = .98) indicated high reliability. The variety of item total correlations on the combined item set (ranging from .21 to .81) suggested that an IRT model accounting for the heterogeneity in discrimination parameters was necessary.

IRT assumption checking
For each depression measure and the combined item set, the ratio of the first to the second eigenvalues considerably exceeded 4. From the CFA, the fit statistics suggested either adequate or very good fit depending on the fit statistics referenced (Table 2). Notably, for the combined item set for which the IRT calibration was planned, the fit statistics showed very good fit (CFI = 0.95, RMSEA = 0.051, TLI = 0.95). All these results lend support to the essential unidimensional assumption.
Local independence was largely assumed, with the exception of one item pair. Between BDI-II item "Crying" and CES-D item "I had crying spells", this item pair exhibited a LD χ 2 value slightly higher than 10 (χ 2 = 10.4), likely because the items were similar in content.
Considering that the data were essentially unidimensional and that almost all item pairs were locally independent, we considered that the data were suitable for IRT calibration and thus proceeded with the IRT analysis.

Evaluation of linking assumptions
The linking assumptions of construct similarity, equity and subgroup invariance were tested for the appropriateness of linking. To ensure that the five scales essentially measure the same or similar underlying constructs, we considered the single factor solution from the CFA and the high level of internal consistency from Cronbach's alpha on the combined item set (α = .98) to be supporting evidence of construct similarity. To ensure that the scores of the five measures to be linked were highly correlated for the equity assumption, we computed correlations (ranging from .73 to .85) and disattenuated correlations (ranging from .85 to .93) in the pairwise observed scale scores ( Table 2), indicating that the five measures were strongly correlated. In terms of the subgroup invariance assumption, the same item function relating IRT-derived θ scores and summed scores generally held across gender groups, providing support for the subgroup invariance assumption.

IRT concurrent calibration and goodness-of-fit assessment Evaluation of estimation accuracy and model-data fit
Although the sample was of moderate size (207 participants), the average SEs for item parameters ranged between .20 and .30 (Table 3), demonstrating that the IRT model was well estimated. It suggested that acceptable estimation accuracy was largely achieved in this IRT calibration.
Nine items were reported to show a lack of fit, while good fit was indicated for the rest of the items (Table 3). We further examined the consequence of item misfit on the item and person parameter estimates and found that either including or excluding the nine items yielded nearly identical results. Therefore, as we considered the consequence minor and the misfit tolerable [50], we included all items in the outcome score linking.

Comparison of item properties across the five depression measures
The item discrimination (a) parameters (Table 3) across the five measures ranged in value from 0.36 to 3.43 (M = 1.73, SD = 0.62). Notably, items addressing depressed mood, suicidality and feelings of worthlessness provided the strongest discriminating indicators; thus, they were the most useful for discriminating among respondents with varied levels of depression. The second highly discriminating set of indicators included items on fatigue or loss of energy, psychomotor agitation or retardation, and concentration difficulties. The moderately discriminating set of indicators contained items on changes in sleep and changes in appetite. CES-D items on positive affect (i.e., "I am just as good as other people", "I felt good about the future" and "I was happy") had the weakest ability to distinguish respondents with varied depression levels and thus added the least information to the depression measurement. Of additional interest was the great variation in the discriminating abilities of items on loss of interest (a parameter estimates ranging from 0.87 to 2.94).    (Table 3), symptoms pertaining to suicidality, changes in appetite, depressed mood, feelings of worthlessness and psychomotor agitation or retardation were associated with high levels of severity. Items on concentration difficulties, fatigue or loss of energy and loss of interest, followed by problems related to changes in sleep, were associated with moderate levels of severity. All of the four CES-D items on positive affect were associated with the lowest levels of severity.
With respect to item information, among the items with similar a values, the level of precision/usefulness for assessing depression differed along the θ continuum. For instance, between DASS-Depression items "Felt down-hearted and blue" (a = 3.11) and "No positive feeling at all" (a = 3.00), the former was more useful for differentiating respondents with depression levels along θ < −0.8 and 0 < θ <1, and the latter was more informative for discriminating respondents with depression levels along −0.8 < θ < 0 (Fig. 1).
Outcome score linking and score concordances construction Comparison of cut-off theta scores across the five depression measures Each (observed) summed score for each measure transferred to an IRT-derived θ (theta) score. The score concordances at cut-off scores are reported in Table 4. For instance, in the 20-item CES-D, a summed score of 16, the cut-off point for identifying respondents as being at risk for depression, transferred to a θ score of −0.95, indicating that the cut-off score of 16 distinguished people with a θ score above −0.95 from those with a θ score below −0.95.

Comparison of cut-off summed scores across the five depression measures
In the same score concordances, each cut-off (observed) summed score for each measure was associated with a (observed) summed score for each of the other four measures (Table 4). Notably, the resulting cut-off scores  across the five measures led to either a consistent or discrepant diagnosis for depression. For instance, the cutoff scores for mild depression on the BDI-II and the PHQ-9 were equivalent to each other, whereas the cutoff score for moderate depression on the BDI-II corresponded to the cut-off score for mild depression on the HADS-Depression.

Comparison of measurement precision across the five depression measures
Concerning the standardization of the five measures' measurement precision, a test information value of approximately 10 reflects conventional reliability of .90 as derived from classical test theory [51]. As shown in Fig. 2, both the BDI-II and the CES-D were informative on a wider range of depression levels, and they exhibited greater measurement precision than the other three measures, where the BDI-II was more useful for differentiating depression levels for θ scores approximately between −1 and 2.3 (normal to severe depression) and the CES-D was more informative for discriminating respondents with depression levels along θ scores from approximately −1.5 to 2.0. The PHQ-9 offered great potential in assessing depression levels along θ scores from approximately −0.7 to 1.7 (mild to severe depression). The DASS-Depression was informative for assessing depression levels along the θ continuum between −0.3 and 1.3 (moderate to extreme severe depression). Among the five measures, the HADS-Depression was the least informative for assessing varied depression levels, and its maximum test information was roughly equivalent to a conventional reliability of .78.

Discussion
This is the first study, in the Chinese context, to utilize an IRT approach to the measurement of depression through an examination of five depression measures simultaneously, namely, the BDI-II, CES-D, PHQ-9, DASS-Depression and HADS-Depression.

Psychometric properties and clinical utility of the five depression measures
The work presented herein significantly contributes to knowledge on depression measurement in the Chinese context. First, the findings from this study demonstrated that the five depression measures had sound reliability and validity for depressed Chinese adults. Our findings join previous studies [10, 11, 13-16, 18, 52-54] in providing supporting evidence of the psychometric properties of these instruments in the same context. Noticeably, CES-D reversely scored items measuring positive affect (e.g., "I am just as good as other people", "I felt good about the future") were found to be the least discriminating and to reflect the least severe symptoms; thus, they added little to the measurement precision of depression assessments in the studied context. Our findings echo the work of Iwata et al. [55], who suggested that the CES-D positive affect items with positive wording cannot adequately assess depressive disorders in the Japanese population. This observation across cultures leads one to rethink more broadly about the role of these instruments in guiding treatment decisions. In determining treatment outcomes, remission is traditionally defined by substantial (or complete) alleviation of depressive symptoms. In the absence of apparent biological state markers for major depression, monitoring of recovery progress could only be defined phenomenologically, often times by comparing patients' symptom severity with a predetermined diagnostic threshold or clinical cutoff scores in these well-validated depressive inventories [56]. These Table 4 Score concordances at cut-off scores of BDI-II, CES-D, PHQ-9, DASS-Depression, and HADS-Depression  conceptions, however, were challenged by recent researches advocating a broadening of the concept of remission beyond symptom resolution (e.g., [57,58]). The new proposal concerns that multiple domains, including for example subjectively perceived functional improvement and quality of life, should also be taken into account if a holistic, patient-centered metric of recovery is considered. In light of this, more comprehensive depression instruments, such as, the Remission from Depression Questionnaire (RDQ), had been developed [59]. From a culturally sensitive perspective, the importance of incorporating these person-centered instruments in addition to standardized depression symptom severity scales were implicated by the present findings, especially when the information is to be used in guiding treatment decisions in the practical field. This is because the benchmark of specific item endorsement on a symptom severity scale may be culturally-dependent, and patients' perspective on remission status may provide collateral information in helping with efficacious treatment planning tailor made to individual's needs. Second, our findings help elucidate cultural variations in depressive symptomatology. Symptoms pertaining to psychologization, such as depressed mood, suicidality and feelings of worthlessness, served as the strongest discriminating indicators, while symptoms pertaining to somatization, such as psychomotor agitation, fatigue or loss of energy, concentration difficulties, changes in sleep and changes in appetite, were found to exhibit highly or moderately discriminating abilities. In terms of severity, symptoms related to suicidality, changes in appetite, depressed mood, and feelings of worthlessness appeared to reflect a high level of severity in the Chinese clinical sample. The findings of the present study share some consistencies with those from previous studies. For instance, suicidality and changes in appetite also emerged at a high level of severity in Western contexts [2,60]. However, discrepancies do exist. The symptom of feelings of worthlessness was ranked at a relatively low level of severity in the Western context [2], while the same symptom appeared to be rated at a relatively high level of severity by the Chinese outpatients in our study. Similarly, the high level of severity and high discriminating ability of feelings of worthlessness in our findings are in accordance with Saito et al.'s work conducted on a Japanese community sample [61]. Intriguingly, recent work also showed that the cognitive component of negative self-evaluation is an important factor that differentiates reports of depressive symptomatology between Asian and Western youths [62]. The salience of a heightened sense of self-worth may be related to a deep-rooted Confucian value among Asian Chinese, where a person's intrinsic value is highly dependent on how well the person meets social expectations in serving the collective interest of the social group. Furthermore, a loss of functioning resulting from depression, especially in its severe form, may bring about intense shameful feelings and Intriguingly, a closely related observation is that several items that demonstrated misfit seemed to be associated with a systematic symptom theme. For example, the items "Guilty feeling" and "Punishment feeling" in BDI-II; the items "I am just as good as other people" and "I felt good about the future" in CES-D; and the item "Feeling bad about oneself" in PHQ-9, all loaded onto the same "Feelings of hopelessness" (FH) theme. These items reflect a strong sense of responsibility and echo with the cultural belief that a person's value should be closely linked with the social roles that one is expected to perform in collectivistic societies. It would be interesting to test if the same pattern of misfitting items would be observed in individualistic cultures in future studies.
Third, the findings on the item and test information offer valuable information regarding how each item/ symptom and each measure reliably/precisely assess depression at varied levels. Though they may share similarities in discrimination parameters, items may vary in precision/usefulness for assessing varied levels of depression. For instance, between two DASS-Depression items with similar a values, "No positive feeling at all" was more useful for assessing mild and moderate depression, whereas "Felt down-hearted and blue" was more informative for assessing moderate to extreme severe depression. The finding in this example helps us better understand the gradient of affective dysregulation experienced by sufferers of depression and suggests that a loss of positive affect may precede, or interactively exacerbate, the experience of intensive negative affect in the course of depression.
At the scale level, the findings showed that the five depression measures contributed in various degrees to measurement precision along the full range of the underlying depression levels, providing insight into instrument selection. Specifically, in the studied context, the BDI-II and the CES-D were informative on a wider range of depression levels and had greater measurement precision than the other three measures. The PHQ-9 and the DASS-Depression were particularly useful for assessing depression in clinical populations, as the former was informative for measuring depression ranging from mild to severe and latter was informative for assessing depression ranging from moderate to extreme severe. Accordingly, clinicians can choose the measure that is the most useful/precise for assessing a specific level of depressive severity at the patient level in either clinical or epidemiological populations. Notably, the HADS-Depression appeared to be the least informative for assessing depression in the Chinese context, based on the observation that moderate or low discrimination parameter estimates were reported on the majority of items in this scale.
Our pattern of score concordances results echoes previous studies in suggesting that commonly used depression scales seemed to differ in their diagnoses for depression severity. Zimmerman and colleagues [63,64], for example, administered Hamilton Depression Rating Scale (HRDS), PHQ-9, as well as Clinically Useful Depression Outcome Scale (CUDOS) and Quick Inventory of Depressive Symptomatology (QIDS), to a group of clinically depressed patients and compared the diagnostic outcomes as indicated by the reported scores in each case. The authors noted significant variance in the distribution of patients being classified into discrete levels of severity categories when different scales were used. The level of disagreement implied that treatment planning solely based on data collected from a single self-report scale may be over-inclusive, despite that these scales were all well-validated and standardized.
Finally, the clinical values of the score concordances reported herein are worth highlighting following from the previous point. With scores obtained from the administration of one depression measure, one can use the concordance table to locate the corresponding scores on other depression measures without administering them. Clinicians can then determine depression diagnoses for individual respondents on the basis of the cut-off scores for these rating scales and other interview-based assessments. Further, scores across the five measures are not only aligned with each other in the observed score metric but also mapped to the IRT scores at the θ metric. Such mapping offers clinical meanings for the arbitrary θ metric. For instance, respondents who scored 0.47 or above (at θ metric) on the BDI-II are likely to be diagnosed as severely depressed. Clinicians can then refer to the item information function curves to locate the symptoms that are more informative for assessing this restricted range of severe depression.

Advantages of the methodology
The methodology used in this in study has several remarkable strengths. First, we followed a single-group design for the outcome score linking. Such a design directly controls for differences in response propensities because the instruments are administered to the same respondents [48]. Additionally, we used concurrent calibration, which is less time-consuming and produces more stable results than separate calibration [48]. Second, we tested the linking assumptions. Such a practice deserves more attention, and it is strongly encouraged in studies on linking PRO measures to ensure the validity of the inferences drawn from the score concordances. Finally, instead of relying solely on chi-square-like IRT fit statistics, which can be sensitive to sample size, we evaluated IRT item misfit by focusing on the consequences of using misfitting items and item statistics associated with them, a strategy strongly recommended by Hambleton and Han [65] and Zhao [50]. We hope that future studies adopting a rigorous approach to addressing methodological issues are encouraged in order to promote the quality of PRO research and to ensure the appropriate application of IRT models.

Limitations and future directions
The major limitations and future directions of the present study are discussed below. First, a convenient sampling approach was used to recruit participants because of practical restrictions, which limits the representativeness of the sample and the generalizability of the results. A related issue is the unbalanced gender ratio, which limits the power of using statistical tests to examine gender differences. Second, the outcome score linking function/relationship established in the study may be sensitive to population differences [49], and only one linking approach was used in this study. It would seem prudent to evaluate the robustness of the linking relationship across different samples (e.g., in Chinese nonclinical samples) and across multiple linking approaches (e.g., both IRT-based and non-IRT-based approaches). Additionally, whether the invariance of item parameters holds across clinical and nonclinical populations also requires further investigation. With additional sets of larger clinical and epidemiological samples, a more robust item bank and score concordances can be established. Third, the present study did not incorporate other patient-centered instruments in assessing perceived remission status for comparison purpose. As mentioned previously, these patient-centered instruments were informative in defining depression remission with reference to symptom severity and it would be useful to take into account, as well as to explore the potential merits of, these instruments. Future studies could consider including the Remission from Depression Questionnaire [58,66] and/or the Remission Evaluation and Mood Inventory Tool [67,68] as examples. Furthermore, it would be useful to conduct follow-up studies with large samples to cross-reference the depression scales with interview-based clinical diagnostic tools relating to depressive symptomatology. Finally, the five depression measures covered in the study have all been developed in Western cultures, although the Chinese versions of these measures have been demonstrated to have sound psychometric properties. Nonetheless, the cut-off scores for depression diagnosis that have been suggested based on the Western context deserve further validation in the Eastern context.