The Chinese version of the Perceived Stress Questionnaire: development and validation amongst medical students and workers

Background A valid and efficient stress measure is important for clinical and community settings. The objectives of this study were to translate the English version of the Perceived Stress Questionnaire (PSQ) into Chinese and to assess the psychometric properties of the Chinese version of the PSQ (C-PSQ). The C-PSQ evaluates subjective experiences of stress instead of a specific and objective status. Methods Forward translations and back translations were used to translate the PSQ into Chinese. We used the C-PSQ to survey 2798 medical students and workers at three study sites in China from 2015 to 2017. Applying Rasch analysis (RA) and factor analysis (FA), we examined the measurement properties of the C-PSQ. Data were analyzed using the Rasch model for item fit, local dependence (LD), differential item functioning (DIF), unidimensionality, separation and reliability, response forms and person-item map. We first optimized the item selection in the Chinese version to maximize its psychometric quality. Second, we used cross-validation, by exploratory factor analysis (EFA) and confirmatory factor analysis (CFA), to determine the best fitting model in comparison to the different variants. Measurement invariance (MI) was tested using multi-group CFA across subgroups (medical students vs. medical workers). We evaluated validity of the C-PSQ using the criterion instruments, such as the Chinese version of the Perceived Stress Scale (PSS-10), the Short Form-8 Health Survey (SF-8) and the Goldberg Anxiety and Depression Scale (GADS). Reliability was assessed using internal consistency (Cronbach’s alpha, Guttman’s lambda-2, and McDonald’s omegas) and reproducibility (test–retest correlation and intraclass correlation coefficient, [ICC]). Results Infit and/or outfit values indicated that all items fitted the Rasch model. Three item pairs presented local dependency (residual correlations > 0.30). Ten items showed DIF. Dimensionality instruction suggested that eight items should be deleted. One item showed low discrimination. Thirteen items from the original PSQ were retained in the C-PSQ adaptation (i.e. C-PSQ-13). We tested and verified four feasible models to perform EFA. Built on the EFA models, the optimal CFA model included two first-order factors (i.e. constraint and imbalance) and a second-order factor (i.e., perceived stress). The first-order model had acceptable goodness of fit (Normed Chi-square = 8.489, TLI = 0.957, CFI = 0.965, WRMR = 1.637, RMSEA [90% CI] = 0.078 [0.072, 0.084]). The second-order model showed identical model fit. Person separation index (PSI) and person reliability (PR) were 2.42 and 0.85, respectively. Response forms were adequate, item difficulty matched respondents’ ability levels, and unidimensionality was found in the two factors. Multi-group CFA showed validity of the optimal model. Concurrent validity of the C-PSQ-13 was 0.777, − 0.595 and 0.584 (Spearman correlation, P < 0.001, the same hereinafter) for the Chinese version of the PSS-10, SF-8, and GADS. For reliability analyses, internal consistency of the C-PSQ-13 was 0.878 (Cronbach’s alpha), 0.880 (Guttman’s lambda-2), and 0.880 (McDonald’s omegas); test–retest correlation and ICC were 0.782 and 0.805 in a 2-day interval, respectively. Conclusion The C-PSQ-13 shows good metric characteristics for most indicators, which could contribute to stress research given its validity and economy. This study also contributes to the evidence based regarding between-group factorial structure analysis.


Introduction
Stress has been as an old and a pivotal concept, but no commonly accepted definition of the term, in the health research since it is associated with various health outcomes and quality of life. Three prevailing approaches have been used by researchers to assess different aspects of this construct. Previous study concerning Selye's response-based stress model assuming that events themselves act as the causal agent behind pathology, illness, cognitive impairment, maladaptive behavior, and other unhealthy outcomes; this model focuses on the assessment of the activation of specific physiological systems that are involved in the stress response [1,2]. The stimulus model of stress, by comparison, emphasizes on the measurement of stressors in terms of environmental conditions (i.e. environmental stressors or stimuli) [3]. The transactional model of stress concentrates on the evaluation of the degree and type of the challenge, threat, harm, or loss, as well as on the individual's perceived abilities to cope with such stressors [4]; the view to support this model implies, further, that stress is not the product of an imbalance between objective demands and response capacity, but rather of the perception of these factors [5,6]. Although recognition around this general conceptualization over time, from which the construct of "perceived stress" arisen [7], the critical constructs underlying perceived stress have been more complex and challenging to evaluate.
As regards the measurement of stress, there is no clear consensus as to what the criteria should be for referral to measuring stressors in the case of objective conditions, including, but not limited to: (a) major life events and daily hassles (cumulative minor stressors) [3,8,9], (b) stress appraisal (perceptional processing) and/or emotional response [7,10], (c) the coping and perceptions of control [11]. Indeed, the coping can be seen as a process, a strategy, and a response to all the elements (e.g., environment, individual disposition) that play a role in the effort to adapt [12]. No matter what kind of evaluation system, there are obvious drawbacks that limit their usefulness in past research.
Summers up the results of empirical research, accumulated or chronic stress has an adverse impact on mental well-being and physical health, whereas an important concern is that acute and temporally life events could not predict illness to the same extent [13], and what's more, life events do not predict symptoms [14]. In addition, the personal impact of life events cannot be ascertained before the event actually occurred [15]. Recent stress research suggests that minor, chronic, daily stressors may be more important in determining outcomes than major life events [16]. Other approaches to measuring stress have diverted the focus from specific objective stressors to even more chronic and stress experiences independent of concrete objective occasion, known as a "subjectively experienced stress" [17]. Admittedly, inclination towards assessment of stress appraisal rather than stressful life event itself has since been targeted; more emphasis has been given to the development of stress measurement instruments that focused mainly on the subjective perception of the individual [7,[17][18][19][20].
Perceived stress is the feelings or thoughts that an individual has about how much stress they are experiencing at a given time or over a time period span, which reflects the interaction between an individual and environment [21]. Under such background, as an alternative instrument for assessing the perception of stress, studies increasingly have used the Perceived Stress Questionnaire (PSQ) of developers Levenstein and coworkers [22]. To understand the dimensionality of perceived stress, it has been aimed at overcoming some of the difficulties concerning the definition and selecting items tapping potential cognitive, emotional, and symptomatic sequelae of stressful events and circumstances, which tend to trigger or exacerbate disease symptoms [2,23,24]. The PSQ is specifically recommended for clinical settings, especially in psychosomatic medicine, though it has been employed in research studies as well. Similarly, another measuring stress perception tool, the Perceived Stress Scale (PSS) of developers Cohen and colleagues [7,18], belongs to the most common instrument of this field in the literature. The original instrument (English) includes 14 items, and other forms have been evolved for 10-and 4-item subsets of the PSS over time; and it is currently translated into over 30 languages in accordance with Laboratory for the Study of Stress, Immunity, and Disease (Retrieved from: https://www.cmu. edu/dietrich/psychology/stress-immunity-disease-lab/index. html). The PSS items assess the extent to which respondents find their lives has been unpredictable, uncontrollable, and overloaded during the previous month. Moreover cannot but raise, what differs from the PSQ to the PSS is the specific nature of dimensions and elements, the former viewed affect and psychosomatic conditions as triggers of subsequent symptomatology and reflective of perceived stress, rather than as symptoms themselves, whereas the latter concerned about cognitive appraisal of stress and the respondent's perceived control and coping capability [2,18,22,23]. Again, both the PSQ and the PSS have been found to predict many psycho-physiological (psychological or physiological) outcomes that one would expect to follow from stress [25][26][27][28][29][30][31][32][33][34]. Accumulating research, expectantly, will continue and accelerate to focus on perceived stress in relation to health and disease over the upcoming years.
Other than the source language (English and Italian), there are multiple language versions of the PSQ currently, namely Swedish [35][36][37], Greek [38], German [23,39], Spanish [40], Thai [41], Norwegian [25], French [42], Arabic [43] and Chinese [44]. Review of the literature suggests that in various cultures and countries, some of them provide relatively complete the psychometric properties, and others brief and incomplete, whereas the latter greater emphasis on clinical application. This tool contains two alternative forms, the General PSQ and the Recent PSQ, based upon respondent's feelings and thoughts in a given time range, during the last two years or during the last month, respectively. The original PSQ has 30 items that distribute seven dimensions: harassment, overload, irritability, lack of joy, fatigue, worries and tension [22]. The Chinese version of the PSQ (C-PSQ) was tested only in nursing students in China, apart from some indicators of psychometric still existed with insufficiency [44]. Furthermore, longer questionnaires result in higher data collection costs and greater respondent burden and may lead to lower response rates and diminished quality of response [45]. Recent findings have suggested that the original PSQ in routine use could lead to respondent burden and has item redundancy [23,37]. Specifically, the C-PSQ-30 likewise also needs to be parsimonious in order to keep the length of this scale as short as possible. As such, following previous research, this study examines two or more samples to evaluate the psychometric properties using Rasch analysis, factor analysis and other statistics methods through a psychologically comprehensive measurement.

Measures
Perceived stress questionnaire (PSQ) The PSQ was translated into Chinese using forward translations and back translations based on an integrated method and these guidelines [46][47][48], as described below: Stage 1: Initial translation; two bilingual translators independently translated the original PSQ (English) into simplified Chinese.
Stage 2: Reconcile and synthesis of the translations; the researchers invite two translators and community experts (bicultural and bilingual individuals) to reconcile and synthesize the translations.
Stage 3: Back translation; using the synthetic version of the instrument from stage 2, another two bilingual translators separately translated it into English.
Stage 4: Expert committee; the ten-member expert panel and the original developer of the PSQ did review all the translations, reach a consensus on any discrepancy, and develop the pre-final version.
Stage 5: Pre-testing; during the internship, nine nursing students at the hospital participated pretest. Each student kindly completed the questionnaire (pre-final version). We, too, closely interviewed these participants to guarantee that there were no unintelligible or ambiguous questions. Finally, the final Chinese version of the Perceived Stress Questionnaire (C-PSQ) has been finalized.
Additionally, we emailed the final version to consult with Dr. Susan Levenstein to ensure that the two versions were equivalent in four levels: semantic, idiomatic, experiential and conceptual [48].
The C-PSQ is consistent with the original version of the PSQ (English) both in item order and scoring method, rating each item with reference to frequency of occurrence on a four-point Likert scale (1: almost never, 2: sometimes, 3: often, and 4: usually). Eight items (1,7,10,13,17,21,25,29) need to be reverse scored. The PSQ index is calculated as (raw score -30)/90, i.e. (raw score -the lowest possible score)/ (the highest possible score -the lowest possible score), which ranges from 0 to 1, with higher values indicating greater level of perceived stress.

Perceived stress scale (PSS)
As the PSS is short and easy to complete, it can be used together with other measures [49], thereby being selected as the criterion. Meanwhile, among three forms (number of items) of the PSS, it is recommended that the PSS-10 be used to measure perceived stress, both in practice and research [34,50]. Given that the Simplified Chinese version of the PSS-10 (C-PSS-10) gained Dr. Cohens' recognition [51], this form of the PSS was chosen in this survey. The C-PSS-10 consists of 10 the original PSS items in which the participants are asked to respond to each question on a five-point Likert scale (0 = never to 4 = very often), indicating how often they have felt or thought a certain way over the past 4 weeks. Six items (1,2,3,6,9,10) are negative and the remaining four (4,5,7,8) are positive, the latter are reverse scoring items. Composite scores can range from 0 to 40, with higher scores representing greater perceived stress.

Short form − 8 health survey (SF-8)
The SF-8 Health Survey (SF-8), a concise and generic assessment tool, especially in large-scale observational studies, generates a health profile consisting of eight sub-scales: physical functioning (PF), role limitations due to physical health problems (RP), bodily pain (BP), general health perceptions (GH), vitality (VT), social functioning (SF), role limitations due to emotional problems (RE), and mental health (MH), which are used for computing two summary measure scores (physical component score PCS and mental component score MCS) [52]. The SF-8 is comprised of an eight-item subset of the Short Form-36 Health Survey (SF-36) and has been translated into Chinese following the standard International Quality of Life Assessment (IQOLA) protocol in a prior study [53], whose Chinese version is repeatedly confirmed feasible, reliable, and valid using a large, representative sample from China and is readily available [54,55]. The health dimensions used in our study are evaluated for physical health and mental health, which is scored with the Medical Outcomes Study scoring system [52]. Total scores are calculated as the weighted sum of the scores for all items, fluctuated in the range 0-100, with higher scores denoting better health.

Goldberg anxiety and depression scale (GADS)
The Goldberg Anxiety and Depression Scale (GADS), individually referred to as Goldberg Anxiety Scale (GAS) and Goldberg Depression Scale (GDS), is an 18-item self-report symptom inventory [56]. The global score, which ranges from 0 to 18, is based on responses ("yes" or "no", with one or zero point respectively), asking how respondents to report symptoms experienced over the past month. Each subscale can give a maximum total of 9, with higher scores suggesting greater levels of symptomatology. Generally, anxiety score ≥ 5 or depression ≥2 shall be deemed as a 50% risk of a clinically important disturbance [56]. The GADS was selected as a comparator scale in earlier studies, which have revealed good psychometric properties and proven that it could be reliably and acceptably used by health sectors not specialized in mental health [57][58][59]. Based on the translation approaches mentioned above, the Chinese version of the GADS has been published elsewhere and displayed highly correlation with the C-PSQ in nursing students [44]. The various languages of the GADS presented a simple, quick and accurate method of detecting depression and anxiety in the general population.

Setting and participants
The total sample size consisted of 2798 from three cities of China, i.e. Wuhan, Ningbo, and Shiyan, respectively corresponding to three samples named A, B, and C ( Table 1). The participants were recruited from universities (colleges) and hospitals, which is closely related to medical field. Sample A and B belonged to medical students, while sample C was medical workers. A convenience sample of 130 undergraduate or postgraduate students at one university of public health, nursing, clinical or other medical related in Wuhan City participated in the survey. A total of 122 students in this sample completed the second test at last. Sampling method of sample B was stratified random sampling strategy and stratified college students by their grades. Briefly, we aimed to randomly sample 50% of the students from each grade of nursing students to obtain large, representative samples. Flowchart of the sampling strategy of sample B is shown in elsewhere [60]. Overall, a total of 1519 students from one college in Ningbo City were randomly selected. Sample C adopted stratified sampling to ensure maximal consideration of sampling representation by means of controlling their proportion of departments and occupational classes. Three hospitals in Shiyan City were randomly selected and this sample finally amounted to 1223 valid questionnaires for analysis. All participants were given a small incentive: a bar of chocolate or a pen worthy of 5 RMB (around 0.8 US dollars) for each responder as compensation for their time. Response rate: 93.85% for sample A, 95.66% for sample B and 90.59% for sample C. An average duration of the assessment for each respondent is about 15 min, and the top of the first page is printed with instructions for the questionnaire fulfillment. Sample test using the instruments was organized as follows. Using the C-PSQ and C-PSS-10, sample A was tested two times at two days interval (test-retest method). Sample B was measured by the C-PSQ and the Chinese GADS. Sample C was investigated through the C-PSQ, the Chinese SF-8, and the Chinese GADS (Table 1).

Statistical analysis Item response theory (IRT)
Given that the responses are ordinal, we used item response model for analysis of the Chinese version of the PSQ. IRT application requires two important assumptions [61]: (1) the construct being measured is in fact unidimensional and (2) the items display local independence. As Georg Rasch noted, Rasch measurement generally converts dichotomous and rating scale observations into linear measures. In contrast to classical test theory, Rasch analysis accounts for both the difficulty of tasks (item difficulty) and the abilities of subjects (person ability) by modeling the relationship between a latent trait (i.e. a respondent's functional ability) and the items used to measure that trait.
To validate the Chinese PSQ, these key indicators could best be summed up as: (1) Information-weighted fit (Infit) and outlier-sensitive fit (Outfit) mean square (MNSQ) statistics. Reasonable item mean square ranges for Infit and Outfit between 0.6 and 1.4 were considered as an indicator of acceptable fit, since type of test was rating scale (survey) [62]. (2) Unidimensionality. In addition to item-fit statistics, unidimensionality of the measured trait was assessed further using principal component analysis (PCA) of the residuals. There were two criteria: the variance explained by the first component should be adequate (> 50%); the unexplained variance in the first contrast of the residuals should be less than 3.0 eigenvalue units, preferably < 2.0 eigenvalue units [63].
(3) Local dependence (LD). Local item independence requires that an item be independent of other items -can be tested by the residual correlation between the items, with a cutoff value less than 0.30 [63]. Furthermore, following the latest recommendations, evaluation of local response dependence should also take into consideration the residual correlation relative to the average residual correlation [64]. (4) Person separation index (PSI) and person reliability (PR). Person separation is used to classify people. The ability of the scale to distinguish different strata (or groups) among participants was assessed using PSI and PR. They are indicators of the fit statistics' reliability. An acceptable level of person separation of 2.0 and reliability of 0.8 corresponded to the ability to differentiate among 3 strata; while person separation of 3.0 and reliability of 0.9 respectively represents an excellent level or reliability. (5) Differential item functioning (DIF). Differential item functioning refers to the situation where members from different groups (e.g. different populations, gender, socioeconomic level) on the same level of the latent trait (disease severity, quality of life) have a different probability of giving a certain response to a particular item [65]. DIF contrast was considered absent if it was less than 0.50 logits (between − 0.50 and 0.50 logit values) [63], minimal but probably inconsequential if it ranged between 0.50 and 1.0 logits, and notable if it was > 1.0 logits. (6) Category thresholds. Category threshold order, which is reflected by the category probability curves, is an important parameter for demonstrating the usage of response categories, and it is essential for the calculation of person and item calibrations. Disorder thresholds occur when respondents have difficulty discriminating between ordered response options. (7) Person-item map. The map presents person measures ranked by their ability level and item difficulties ranked by difficulty. It can provide a way to visualize how well the items target the ability of the respondents. Optimally, the difference between respondents and item measure should be approximately 0 logits. Generally, a mean difference between the person and item measure in magnitude of 1.0 logits indicates significant mistargeting. (8) Discrimination index. The index of indiscrimination was defined as the ability of an item on the basis of which the discrimination is made between superiors and inferiors. Ebel and Frisbie gave following rule of thumb (i.e. 0.40 and up, very good items; 0.30 to 0.39, reasonably good but possibly subject to improvement; 0.20 to 0.29, marginal items, usually needing and being subject to improvement; below 0.19, poor items, to be rejected or improved by revision) [66] for determining the quality of items with respect to their discrimination index.

Classical test theory (CTT)
To evaluate the psychometric properties was an integral part of introducing a useful health measurement tool [67]. Validity was concerned with the true value and accuracy that a measure attempts to capture, and Reliability was defined as the consistency and precision of a measurement [68]. For validity evaluation work, we in turn assessed the construct validity, concurrent validity and convergent validity. The construct validity, factorial validation and the scale structure were verified through exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) in aspects of exploration, validation and cross-validation. It would be better to split the sample and use one part of the data to derive a model and the other part to confirm the derived model. For exploratory analysis, Maximum Likelihood (ML) with an oblique rotation (promax, power coefficient = 4) were conducted, this choice of method for extraction and rotation was motivated by these prior studies [23,37], and the number of components to retain was determined by eigenvalues (> 1), scree plots, items content and interpretability as well as total variance explained (usually 60% or higher) [69]. For confirmatory analysis, give that responses to items in the PSQ are obviously ordinal, we used a Weighted Least Square Mean and Variance Adjusted (WLSMV) to accommodate categorical data [70,71]. Concurrent validity can be described as "scores on the measurement tool are correlated to a related criterion at the same time"; convergent validity can be defined as "extent to which different measures of the same construct correlate with one other" [72]. Concurrent validity and convergent validity were examined by testing Spearman's correlations of the C-PSQ with the scales mentioned above. The correlative coefficient greater than or equal to 0.45 is recommended by many researchers [72]. We did not assess predictive validity and content validity. Content validity was reported elsewhere [44]. In CFA and/or multi-group CFA, some goodness-of-fit indices usually were recommend using benchmarks for judging model fit, such as Normed Chi-square (NC) < 2.0-< 3.0 [73], Non-Normed Fit Index/Tucker-Lewis Index (TLI) > 0.90 [69], Comparative Fit Index (CFI) > 0.90 [69], Root Mean Square Error of Approximation (RMSEA) < 0.05 (or 0.06 denotes "a good fit") or 0.08 (denotes "a reasonable fit") [74,75], Weighted Root Mean Square Residual (WRMR) < 1.0 [76]. To compare the goodness-of-fit between the nested measurement invariance (MI) models, we followed the aforementioned recommendation of using differences in RMSEA, CFI, and TLI. Hereby, models with a change in CFI (ΔCFI) ≤ 0.010, change in RMSEA (ΔRMSEA) ≤ 0.015, and change in TLI (ΔTLI) ≤ 0.010 were favored [77][78][79]. Note that we did not compare with a chi-square difference test in four steps models, including configural equivalence, metric invariance, scalar invariance and strict invariance. Because the consensus was that this may be an overly stringent criterion since Δχ 2 (χ 2 ) test is dependent on sample size with a rejection of models with trivial practical misfit in large samples (N > 300) [78,80,81]. WRMR illustrated worse fit when sample size increased or model misspecification increased [76].
For reliability assessment, we first evaluated internal consistency using Cronbach's alpha, Guttman's lambda-2, McDonald's omegas, item-total correlations, and splithalf reliability coefficient. Cronbach's alpha, Guttman's lambda-2 (a better reliability estimation method [82]) and McDonald's omegas (an optimum estimation on homogeneity reliability) are both internal reliability coefficients [83]. Item-total correlations offer information about how well each item is associated with total score for further assessment of internal consistency. Split-half reliability correlates scores between randomly divide all items that purport to measure the same construct into two sets, calculated based upon Spearman-Brown prediction formula in this study. Second, we evaluated the reproducibility, including test-retest reliability or score consistency over time using Pearson's correlation and intraclass correlation coefficient (ICC) at an interval of two days. ICC estimates and their 95% confidence intervals were calculated selecting single measures and twoway mixed-effects model with absolute agreement type in view of method and range in collecting retest data [84,85], to assess level of agreement between scores at two time points. We also computed the standard error of measurement, which helps quantify the variability of measurement errors and estimate measurement precision, as a supplement indicator in test-retest reliability assessment [86]. Cronbach's alpha, a positive rating for internal consistency, reasonably ranges from 0.70 to 0.95 [87]. Considering the proof that alpha ≤ lambda-2 is a standard result in CTT [88,89]; hence Guttman's lambda-2 should move above 0.70. An omega value is above 0.70 indicates that there are a reliable total score [90]. Split-half reliability coefficient estimates above 0.70 are generally considered acceptable [91], obviously, it will be very close to 1.0. Item-total correlations should move in a range between 0.30 and 0.70 [92]. With respect to test-retest correlation and ICC, 0.70 or 0.75 would act as a set of recommended threshold values [72,84,87,93].
The use of traditional methods, including CTT, was conducted using SPSS/PASW Statistics

Ethics statement
Prior to launching this study, ethical approval was provided by the Ethics Committee of Wuhan University School of Medicine (WUSM), China. All procedures were in accordance with the relevant requirements of the Declaration of Helsinki and its revised version [94]. Informed consents were obtained from the relevant administrative department at the study site and from the medical students and workers enrolled. The data collection and transfer process were conducted anonymously to ensure full respect and protection of individual privacy rights. In addition, written permission to create and use this Chinese version of the PSQ was obtained from Susan Levenstein M.D. by e-mail.

Participants of the study
The participants' socio-demographic characteristics were shown in Table 1

Rasch analysis (item selection)
Item fit statistics: Item fit statistics showed that almost all items fitted the Rasch model. No items were either under fitting (MNSQ > 1.4) or over fitting (MNSQ < 0.60) (Additional file 1, including Table 1a, b, c, d).
Local dependence (LD): Three item pairs presented local dependency, i.e., displaying positive correlations of their residuals > 0.30. Compared to the average item residual correlation of − 0.033 in the thirty-item data set, the correlations between items one and thirteen of 0.312, items thirteen and twenty-one of 0.349, items twenty-six and twenty-seven were relatively large and these three item pairs were the positive correlation. Differential item functioning (DIF): In general, the items did not show DIF apart from (Additional file 2, including  Table 2a, b, c, d): items 2, 7, 10, 22, 26, 27 (first round); items 15, 24, 28 (second round); items 16 (third round). Unidimensionality: The variance explained by RA ranged from 57.5 to 50.3% and unexplained variance in 1st contrast ranged from 3.21 to 1.70 (Table 2). In the first round, the instruction to delete these items is: 1, 7, 10, 13, 17, 21, 25, 29. Discrimination index: Item 11 did show low discrimination index (0.37, below 0.40) in Table 1a. Finally, a total of seventeen items (i.e., item 1, 2, 7, 10, 11,13,15,16,17,21,22,24,25,26,27,28,29) should be removed. Then, the C-PSQ-13 was formed gradually by these above criterias. Separation and Reliability: Acceptable PSI (> 2.00) and good PR (> 0.80) values were respectively presented in Table 2, suggesting adequate separation ability for this instrument. Response forms: No evidence of disordered thresholds was found in the category probability curves for the C-PSQ-30 and C-PSQ-13, as the category calibration increased in an orderly way (demonstrated in Figs. 1 and 2), and suggesting this rating scale functioned well for both forms. Four response categories were found for all items, indicating three thresholds for each item. Person-item map: The person-item map given in Figs. 3 and 4 illustrated the relationship between item difficulty and person ability. In the C-PSQ-30 and C-PSQ-13, item difficulty had the same mean value = 0 logits, while person ability correspondingly had a mean value = − 0.43 logits and − 0.60 logits. Thus, the difference between the item and the person means were 0.43 logits and 0.60 logits respectively; both are less than 1.0 logit indicates targeting.

Factor analysis (construct validity)
Given sample size in factor analysis, at least 200 cases is probably an appropriate threshold, whereas samples of 500 or more observations are strongly recommended [95,96]. Sampling adequacy for factor analysis was tested separately for medical students (sample A and B) and medical workers (sample C). In the C-PSQ-30,  Kaiser-Meyer-Olkin (KMO) values were 0.951 (medical students) and 0.964 (medical workers); similarly, in the C-PSQ-13, KMO values were 0.923 (medical students) and 0.930 (medical workers), revealing marvelous level of sampling adequacy which were well above the recommended threshold of 0.6 [97,98]. All of Bartlett's test of sphericity were significant (P < 0.001), also denoting that the items could be considered apt for factor analyses. Inspection of eigenvalues, scree plot and item content and interpretability suggested respectively four-factor solution (30 items, medical students), five-factor solution (30 items, medical workers), two-factor solution (13 items, medical students) and two-factor solution (13 items, medical workers). The cross-validation was tested by the other sample set for the model. In particular, the EFA of medical workers indicated that there is only one item (i.e. item 23) on a factor. The CFA model of medical students could not fit, thereby switching to principal component analysis in EFA. Table 3 compared the four models that showed the fit statistics. Through cross-validation, our CFAs found that the two-factor solution using medical students' data to derive a model and using medical workers' data to validate the derived model is the best fitting model. This optimal model has two factors, namely factor I (item 4,8,9,14,18,19,23,30) and factor II (item 3,5,6,12,20). The two factors is inconsistency with that of the recently published literature [44], we renamed these factors "constraint" and "imbalance" respectively. The results of model fit are the same between in second-order model and first-order model owing to its two factors condition. Thereafter, we ran a series of CFA to test various factor structures reported in the literature, including English/ Italian (source language) [22], Spanish [40], German [23], Greek [38] and Swedish [37]. There existed relatively clear and distinct factor solution in these various versions and these were compared with our twofactor solution, Chinese. Table 3 presented the fit indices for all models tested.
Next, results regarding measurement invariance of the C-PSQ-13 across subgroups (medical students and medical workers) are presented in Table 4. The results of four steps ranging from least to most rigorous suggested invariance across subgroups: ΔTLI = 0.002, 0.009, and 0.000 < 0.01; ΔCFI = 0.004, 0.014, and 0.006; ΔRMSEA = 0.001, 0.004, and 0.000 < 0.015. In consideration of subgroups, the C-PSQ-13 can be considered fully invariant (except for 0.014, as described above). In view of sample size across gender and age is too unbalance, therefore we no performed Multi-group CFA in these between groups.

Concurrent and convergent validity
The Chinese PSS-10, SF-8 and GADS would serve as a criterion separately. The correlation matrix of these instruments was depicted as follows ( Table 5). The correlation coefficient between subscales (scale) of the C-PSQ-13 ranged from 0.640 to 0.947, indicating moderate (0.5-0.8) to high correlation. Most of correlation coefficients were above quality criteria (0.45) except for between GAS and imbalance (r = 0.438) in these instruments and its subscales. Especially concerning was that all subscales and the C-PSQ-13 most highly correlated with the Chinese PSS-10 in these criterions whereas the Chinese GADS reflected the lowest correlation with the C-PSQ-13 and its subscales. Coefficients of correlation negative feelings with the C-PSQ-13 and its subscales were higher than positive feelings with the C-PSQ-13 and its subscales, MCS and PCS with similar results. Additionally, the Chinese SF-8 and the C-PSQ-13 and its subscales were negatively correlated. The results demonstrated that scores of other instruments highly correlated with PSQ Index. On the whole, concurrent and convergent validity of the C-PSQ-13 and its subscales was more satisfactory. Table 6 summarized the instrument distribution and the reliability test results based on quality criteria. Of these, adequate item-total score correlations ranges between 0.30 and 0.70, as described in CTT. All corrected itemtotal correlations were in range (except for item 11, r = 0.319), reflecting satisfactory scale homogeneity. Note that if item 11 dropped, Cronbach's alpha and McDonald's omegas on the PSQ would be increased. Cronbach's alpha of the Chinese both PSQ-13 and PSQ-30 were 0.878 and 0.935 respectively. Both McDonald's omegas and Guttman's lambda-2 were the same result, 0.880 and 0.937 respectively. Split-half reliability coefficients were 0.852 and 0.919 individually. Additionally, internal consistency reliability of subscales, using

Discussion
The PSQ was developed in 1993 to examine people subjective stress perception on different clinical or nonclinical areas, including both physical and psychological on quality of life. The results of a Rasch analysis and a factor analysis were complementary, which helped provide a comprehensive perspective on the construct validity of the Chinese PSQ. The previous study validated the measurement properties of the Chinese PSQ by CTT only [44]. Admittedly short instruments (scales or questionnaires) improve assessment as they save response time and effort, increase response rate, minimize burden, and decrease fatigue effect. The development and validation was performed using Rasch analysis, a relatively modern psychometric technique for developing and refining rating instruments (i.e. scales and questionnaires) with sound psychometric properties. Indeed, since both multidimensionality and response dependency are serious threats of the metric characteristics of an assessment and implies that responses to an item depend on responses to other items or that the scale reflects more than one latent trait, requiring support for unidimensionality and local independence [99,100]. Thus, IRT methodology application is contingent on the extent to which these assumption are met [61]. The results (first round) of the Rasch model analysis revealed that the C-PSQ-30 is not unidimensional, since the unexplained variance in the first contrast (3.21) was greater than 2.0 in the PCA. Summary of previous study on the validation of the PSQ showed that this instrument may be subjectively conceived as a seven-factor model [22], sixfactor model [40], five-factor model [37,38], or fourfactor model [23]. Our current study indicated that three pair items showed local dependency, six items (first round) presented DIF and one item demonstrated low discrimination index. According to the assumptions and guidelines [61,63], we finally performed four round   validation until that are met. Of these, we removed 10 items by three rounds of DIF. It would not be more reasonable to build an instrument that is not biased (with items that do not present a Differential Item Functioning). Crucially, 13 items were retained in the Chinese PSQ adaptation (Table 2). Rasch reliability indexes (PSI and PR) confirmed their high values, which give us a good degree of confidence in the consistency of both person-ability and item-difficulty estimates. Our study demonstrated an ordered threshold in the category probability curves, which means that the response forms were adequate, the item difficulty matched medical students' or medical workers' (these respondents') ability levels. The items were well-targeted to the subjects, with a mean difference of 0.43 and 0.60 logits in C-PSQ-30 and C-PSQ-13, respectively. This means that the difficulty of the items on these questionnaires were appropriate for the ability of respondents.
The focus of the present study was to investigate a more appropriate factorial structure of the C-PSQ, especially to improve and promote this Chinese PSQ adaptation. The analyses encompassed the EFA to extract factors, the CFA to test model, and cross-validation of the model seen as suitable in separate large-scale samples. Regarding exploratory factor analysis among medical students, two factors were extracted from the C-PSQ-13. This model is the best fitting model. Pertaining to WRMR, the smaller value, the better fit (acceptable < 1, and good < 0.8 [101]), as Linda K. Muthén noted in 2005, in some cases other fitting indices were good, and the WRMR value is large, so we did not focus on WRMR at that time. Notwithstanding PSQ Index was originally proposed by the instrument developers and counted to a perceived stress index across the PSQ items [22], it is notable that the model established in this study is to continue supporting a perceived stress factor, that reflects all first-order factors [37,38], and confirms that on utilization of PSQ Index do have a certain rationality and feasibility. According to the results of current study and previous studies, this Chinese version (C-PSQ-13), the Swedish version [37] and the German version [23] belonged to the reduced version, whereas the Greek version [38], the Spanish version [40], the Thai version [41], the Norwegian version [25] and the Arabic version [43] retained all 30 items, while its various versions still remained adaptation on levels of items and factors. Upon closer inspection, the structure of the questions in each subscale differed from those of the original instrument. Indeed, these across studies that evaluated the factor structures have reported nonunidimensional for the PSQ. Based on the Recent PSQ rather than the General PSQ form of the questionnaire could possibly have affected the outcome in our study. These conditions, cultural adaptation and translation quality as well as sample properties, would be unable to ignore for influence on factor solution. Cross-cultural differences, perhaps not surprisingly, led to discuss some discrepancy on factor structures of the PSQ.
Criterion validation consists of correlating the new instrument with well accepted measure of the same characteristics, usually known as the criterion validity. Using the Chinese PSS-10, SF-8 and GADS respectively as the criterion, concurrent validation values of the C-PSQ-13 are above a reasonable threshold value (0.45) [72]. More specifically, the correlation with the Chinese PSS is close to 0.80 (high correlation ≥0.80), which revealed some aspect of the new tool with a widely accepted measure of the same characteristics [67]. Predictive validity was failed to assess on account of no follow-up.
A satisfactory level of reliability depends on how a measure is being used. Three internal consistency reliability methods of this reduced version are less than that of the C-PSQ-30, but still display good reliability. Cronbach's alpha values were higher than 0.70 for the C-PSQ-13 and the C-PSQ-30 in the present study, like across studies and then their alpha values held wave nearby 0.90 [22,23,25,36,[38][39][40][41]. The higher alpha values in those studies may be owing to characteristics of the samples. The more items would too have higher Cronbach's alpha values. Guttman's lambda-2 values, only reported in this study, still were greater than quality control standard for the C-PSQ-13 and the C-PSQ-30.
McDonald's omegas values are approximately equal to Guttman's lambda-2 values for the C-PSQ-13 and the C-PSQ-30, respectively. Alpha is and remains to be the best choice among all published reliability coefficients, even though alpha should be replaced by better and readily available methods [82,102]. Hence, we decided to report both alpha and lambda-2, as an indication of internal consistency. Their samples of different studies, at any rate, appeared to have experienced relatively intense stress and thus may have responded to items more consistently. Although internal consistency can be higher in the present study, on most occasions, additional evaluations such as item-total correlations or split-half reliability coefficients were suggested to confirm the internal consistency of the C-PSQ-13.
With regard to reproducibility, the aim was to assess reliability and agreement, through repeated measurements in stable respondents (test-retest) provide similar answers. Notably, test-retest Spearman correlations of the adaptation and the Chinese PSQ are apparently greater than quality criteria. Relatively, these values (0.782 and 0.874) are more than the results at one-week intervals in the former research [44]. Test-retest Spearman correlation of the C-PSQ-13 is less than the original study at 8 days, the Spanish study at 13 days, the Greek study at one month, whereas the result of the Chinese PSQ is more than that of three studies [22,38,40]. These results proved that the instrument has an appropriate level of both stability and responsiveness to change over time. Although test-retest reliability are commonly measured with Spearman correlation, it is better to use the intraclass correlation based on a two-way repeated measures analysis of variance looking at absolute agreement, since this is sensitive to any bias between or among times [67]. ICCs of this adaption and the C-PSQ were above 0.75 and close to 0.90 respectively, indicating good and excellent reliability [84]. The score reproducibility over time of the adaption (r s = 0.782, ICC = 0.805) was less than that of the C-PSQ (r s = 0.874, ICC = 0.899) in 122 participants in this study. In brief, the relatively high internal consistency (alpha, lambda-2, omegas) and reproducibility (test-retest correlations, ICCs) values disclosed strong reliability.
In summary, the results of the present study validated the metric characteristics of the revised PSQ, the Simplification of the PSQ-13, which was adapted from the original PSQ-30. Through examination of a series of results, the C-PSQ-13 obtains good and stable psychometric properties for most indicators and still remains confirmed in current study. To date, no known studies have examined measurement properties of the PSQ using IRT, in combination with CTT. However, this study has several limitations. First, all voluntary samples originated from the medic field, possibly resulting in insufficient sample representativeness and the lack of external validity in our work. In other words, it could limit its generalizability. Second, the study focuses on the fit of Rasch model, item section, construct validity, internal consistency and test-retest reliability. Other forms validity (predictive, content) is needed to more fully support metric characteristics of the instrument. While the values of ICCs in the testing of test-retest reliability were greater than 0.75, securing reliability, the sample size of 122 (only sample A) respondents apparently was a little small. Third, we cannot exclude that some characteristics (such as cross-cultural and language differences, translation quality, sampling attributes, testing situations, forms of instrument [the Recent or General PSQ] and other subjective and objective factors [37,103]) influenced our results. Lastly, most of the data were cross-sectional, thereby limiting the capability of drawing causal inferences. As such, further research should replicate these findings with other populations by adequate follow-up data and/or multi-center studies concerning stress perception.

Conclusion
Taken together, the C-PSQ-13 attained to a valid, reliable, cost and time-effective measuring tool that enables us to evaluate perceived stress both in respect to research studies and clinical settings. It measures two dimensions including constraint and imbalance. The best model is to continue supporting a perceived stress factor and to validate measurement invariance across subgroups, confirming that on utilization of PSQ Index do have a certain rationality and feasibility.
Results contribute to the emerging empirical comparison across studies and/or subgroups concerning the factorial structure of the PSQ. Various studies can be compared with the reference values at hand, such as PSQ Index and different solutions on factor structure from the original. Admittedly, the various language versions of the PSQ, including the original PSQ's structure, were not replicable. Nevertheless, our revision of the PSQ's structure proved relative stability in Chinese language and culture. In consideration of this advantage and respondent burden, the C-PSQ-13 is preferable, as a potentially valuable instrument.