Measuring stress in Australia: validation of the perceived stress scale (PSS-14) in a national sample

Ribeiro Santiago, Pedro H.; Nielsen, Tine; Smithers, Lisa Gaye; Roberts, Rachel; Jamieson, Lisa

doi:10.1186/s12955-020-01343-x

Research
Open access
Published: 15 April 2020

Measuring stress in Australia: validation of the perceived stress scale (PSS-14) in a national sample

Pedro H. Ribeiro Santiago¹,
Tine Nielsen²,
Lisa Gaye Smithers³,
Rachel Roberts⁴ &
…
Lisa Jamieson⁵

Health and Quality of Life Outcomes volume 18, Article number: 100 (2020) Cite this article

18k Accesses
36 Citations
1 Altmetric
Metrics details

Abstract

Background

In Australia, the stress levels have increased over the years, impacting on the physical and mental health of the general population. The aim of the present study was to evaluate the validity and reliability of the PSS-14 in an Australian population.

Methods

The PSS-14 was applied to a large national sample comprising 3857 Australians in the population-based cross-sectional study Australia’s National Survey of Adult Oral Health 2004–2006. The psychometric properties analyzed with the Rasch model and Graphical Log-linear Rasch models were: model fit, item fit, local dependence, differential item functioning, unidimensionality, reliability, targeting and criterion validity.

Results

The PSS-14 did not fit the pure RM (χ2 (55) = 3828.3, p = < 0.001) and the unidimensionality of the whole scale was rejected (p = < 0.001). The Perceived Stress (χ2 (27) = 1409.7, p = < 0.001) and Perceived Control (χ2 (27) = 713.4, p = < 0.001) subscales did not fit the pure RM. After the deletion of two items, the Perceived Stress subscale (χ2 (96) = 94.4, p = 0.440) fitted a GLLRM, while the Perceived Control scale (χ2 (55) = 62.50, p = 0.224) fitted a GLLRM after the exclusion of four misfitting items.

Conclusions

The Perceived Stress subscale displayed adequate psychometric properties after the deletion of two items; however, the majority of problems centered around the Perceived Control subscale. The presence of differential item functioning among four items indicates that adjustment of total scores is required to avoid measurement bias. Recommendations for future applications in Australia are provided.

Introduction

In Australia, the Australian Psychological Society (APS) conducted a ‘State-of-the-Nation’ Stress & Well-Being Survey (SWBS) from 2011 to 2015 to investigate stress at a national level [1, 2]. The results showed that almost two in three Australians (64%) reported that stress was impacting their mental health, while approximately one in five (17%) reported that stress was strongly impacting their physical health [3]. The findings from the last survey, which had 1731 respondents, indicated that compared to 2011 the levels of stress increased, and the levels of well-being decreased in the Australian population. One of the concerning findings was that, among those with severe levels of distress, 61% drank alcohol, 41% gambled, 40% smoked and 31% used recreational drugs as a coping mechanism [2]. The surveys also revealed gender differences. Women were consistently more affected by stress than men and reported financial and health issues as their main sources of concern [3].

One of the many psychological instruments used in the SWBS was the Perceived Stress Scale (PSS) [4]. The PSS is the world’s most widely used instrument to measure perceived stress [5] and since its development has been continuously applied in empirical research [6, 7]. The PSS was developed based on the theoretical perspective of Lazarus [8], which rather than focusing on external environmental stressors, postulated that the stress response is determined by the perception of these environmental stressors. According to Lazarus [8], life events, such as divorce or losing a job, only cause stress when they are appraised as threatening (e.g. “I don’t have another job”) and there is a perception of insufficient coping resources (e.g. “I don’t know anyone who could employ me”). The measurement of stress has then been operationalized in two ways: the environmental perspective (e.g. using life-event scales) and the psychological perspective (e.g. using perceived stress scales) [9, 10]. The PSS was developed to measure stress from the psychological perspective, diverging from the life-event scales regularly used at that time [11]. The initial validations conducted by Cohen [4, 12] led to the creation of two shortened scales derived from the original 14 item-version (PSS-14): the PSS-10 and the PSS-4.

The results of the SWBW surveys were nationally reported by the Australian media (see “Australian women feel more stressed than men, mental health survey finds” [13]). However, the reports did not specify which PSS version was used and indicated only that the “level of stress was derived by summing the scores of the 11 scale items” [2]. Additionally, evidence of validity was not provided. Considering the high levels of stress reported in the Australian population, it is necessary to ensure that psychological measures applied to measure stress in Australians are valid and reliable, so it is possible to have confidence in the interpretation of test results. In the present study, we aim to investigate the psychometric properties of the PSS-14 in the general Australian population and examine whether this instrument can provide a valid measure of perceived stress for future research. To evaluate the PSS-14 validity and reliability we used data collected for the Australia’s National Survey of Adult Oral Health (NSAOH) 2004–2006, a broad project originally aimed to determine the psychosocial determinants of oral health in Australia. Despite being conducted prior to the SWBW, the NSAOH 2004–2006 has a large national sample (n = 3857) that can provide evidence of the PSS-14 validity in the Australian general population.

The present research

The psychometric properties of the PSS have been evaluated in multiple countries [14]. There are, however, two main limitations regarding the generalizability of its psychometric properties to an Australian population. Firstly, the majority of studies evaluated the PSS-14 in small and/or non-representative samples [14]. For example, in China, the PSS-14 was evaluated in a sample of 1860 cardiac patients who smoked [15], while the PSS-10 was evaluated in a sample of policewomen [16]. Secondly, other studies were conducted in countries culturally and economically diverse from Australia, such as the application of the PSS-10 to 479 adults in Thailand [17], a country known for its “collectivist Eastern culture” [18]; or the application of the PSS-14 to 941 adults in Greece [19], which recently experienced financial crisis [20]. Among all countries studied, Canada is the western developed nation most similar to Australia due to its “large geography, low population density and similar health care challenges” [21]. However, the PSS-14 was initially applied in Canada to 96 psychiatric patients [22] and the PSS-4 was later evaluated in 217 pregnant women [23]. The peculiarity of the samples from Canada (i.e. psychiatric patients) and most countries in general makes it difficult to generalize the results to typical members of the Australian general population. For the most part, the PSS has been validated in samples experiencing stressful environments (i.e. patients, students, policemen) rather than in general populations [14].

The most relevant study in a population similar to Australia continues to be the validation conducted by Cohen and Williamson [12] in a representative sample of 2387 Americans. Both countries, Australia and United States (US), are large high income countries [24], with a history of English colonization [25] and populations with similar demographic characteristics [26] and morbidity patterns [27, 28]. Nevertheless, there are important dissimilarities in terms of social-political context between these countries. For example, in the US, the national health system is a private employer-based and individual insurance program that provides coverage to 90% of the population, while Australia has a universal public insurance program covering 100% of the individuals [26]. Although finances are the main source of stress both in Australia [2] and the US [29], these are structural differences regarding how these environmental stressors are experienced by each population (i.e. concerns with health costs are more prominent in the US).

One important characteristic of the Australian population is the cultural background of its Indigenous groups, namely Aboriginal Australians and Torres Strait Islanders (ABTSI). The Aboriginal Australians experiences of well-being are rather distinct from western individuals [30] and “Western psychological concepts are inappropriate and potentially damaging to Indigenous people” [31]. One example is the PSS-14, which was recently validated for an Aboriginal population and the findings showed a weak latent correlation between the “Perceived Stress” and “Perceived Coping” subscales (r = 0.14), a result distinct from the moderate (r = 0.50) to strong (0.70) correlations found in western societies [32]. For these reasons, we followed the recent recommendations by Kowal, Gunthorpe [31] and Santiago, Roberts [32] that ABTSI are a culturally distinct group in which psychological instruments should be evaluated separately from the general Australian population.

Hence, the present study aims to (1) investigate the psychometric properties of the PSS-14 in the general Australian population. We hypothesize that the functioning of the PSS-14 in the Australian population is similar but not equal to its functioning in other high-income countries. In addition, we aim to (2) updated the evidence about the PSS-14 functioning in developed countries using a large national sample and (3) further advance the knowledge regarding the PSS psychometric properties using item-response theory to investigate issues of differential item functioning (DIF) and local dependence (LD). The previous research about stress in Australia showed that “Australian women feel more stressed than men” [13]. Although this result is common in many western countries, a long-established questioning is whether those differences are due to measurement bias [14, 33]. Therefore, we aim to (4) investigate gender difference in PSS scores, and whether differences were due to measurement bias.

Finally, we aim to evaluate criterion validity by inspecting convergence and divergent validity with two psychological constructs (social support and stress at work) of the perceived stress’ nomological network [34]. Social support has been shown by a large body of research as a protective (or buffering) factor against stress [35]. Social support refers to the functions performed by family, friends, and significant others when an individual encounters an external environmental stressor [36]. In this case, family, friends or significant others can help to change the situation (e.g. helping with a task at work) or change the meaning of the situation (e.g. help interpreting the event from a less distressing or extreme perspective) [37]. In both cases, the individual has additional resources to deal with the environmental stressor and this decreases his perception of how stressful the situation is [38].

On the other hand, psychological stress can be experience at work due to a demanding environment. One theoretical model that explains how the work environment generates stressful experiences is the effort-reward imbalance [39]. The model indicates that when the rewards received at work did not correspond to the efforts employed (‘high cost/low gain’), the imbalance can lead adverse stress responses [40]. Therefore, it is expected that participants with high perceived stress will have low social support from friends, family and significant others and experience more efforts with less rewards at work.

To achieve these aims, we analysed data from Australia’s National Survey of Adult Oral Health (NSAOH) 2004–2006, a broad project originally designed to determine the psychosocial determinants of oral health in the Australian population. The NSAOH was chosen since it provides the best available data for the evaluation of the PSS-14 validity in the Australian population. Firstly, the NSAOH sample comprises the largest national Australian sample (n = 3857) in which the PSS-14 has been applied. Secondly, the NSAOH achieved high standards of response quality for surveys [41], including high response rates (77.4%) [42] and low missingness of individual items (0.0 to 1.3%). Survey response rates have declined over the decades, with average rates below 50% been consistently reported since the 1990s [43]. In summary, the large sample recruited at a national level and the high-quality PSS-14 item responses qualified the NSAOH as the preferred data for our research question.

Methods

Participants and procedures

The sample comprised 3857 non-Aboriginal Australians in the population-based cross-sectional study Australia’s National Survey of Adult Oral Health 2004–2006. The NSAOH 2004–2006 was a broad project aimed to determine the psychosocial determinants of oral health in Australia. The survey had a three-stage (i.e. postcodes, households, people) stratified clustered sampling design to select a representative sample of Australian adults. Participants were contacted by study staff who conducted a computer-assisted telephone interview. Interviewees that agreed to undertake dental examinations were mailed the PSS-14 (Supplementary Table 1 – Additional file 1), along with the other complementary measures, as part of a larger questionnaire. The NSAOH 2004–2006 was approved by the University of Adelaide’s Human Research Ethics Committee. All participants provided signed informed consent [44]. A sample of 42 Aboriginal Australians was removed from the analysis since the PSS-14 has been previously validated for this group [32] and it is recommended that psychometric research with Indigenous groups should be conducted separately due to cultural differences [31].

Psychometric properties of the perceived stress scale

The psychometric properties of the PSS have been evaluated in multiple countries, including Spain, Canada, Brazil, Ethiopia and Japan, and its most studied property is dimensionality. There is a consensus, mostly from factor analytical studies, that the PSS has a two-dimensional structure, composed of negatively worded and positively worded items [14]. These two dimensions are consistent with Lazarus’s [8] theory and were named the “Perceived Stress” and “Perceived Control” subscales, although other terminologies such as “Perceived Distress” and “Perceived Coping” have also been used [22].

Considering the robust evidence regarding dimensionality, a few psychometric studies have started to evaluate DIF. One main hypothesis analysed is if the PSS items are biased according to gender [5, 33, 45,46,47,48]. Since women have consistently scored higher than men in the Perceived Stress subscale (but not on the Perceived Control subscale [22, 33, 47], a long-lasting debate in the PSS literature is if score differences are “an artifact of measurement bias” or “true gender differences arising from social, biological, or psychological influences” [14]. The findings regarding DIF by gender are mixed [5, 33, 45,46,47,48,49]. Although some studies indicated no evidence of DIF [5, 33, 46], Cole [45] reported that PSS-10 items 3, 6, 7, 8 and 10 had DIF with a small magnitude and suggested that the “combination of the potentially slightly biased items may explain the apparent test level bias towards women”. Gitchel et al. [47] found DIF by gender for PSS-10 items 1, 3, 4 and 6, a result partially confirmed by Nielsen and Dammeyer [48] (i.e. which also reported DIF for Items 1 and 3). Other sources of DIF have also been investigated. Regarding education, DIF was found for the PSS-10 items 3, 4, 8 and 9 [45], while other studies analyzed age, ethnicity, and literacy [45, 49].

The analysis of LD of PSS items has only recently started [48, 50]. The investigation of LD is especially relevant for the PSS since, in many of the PSS-14 studies, the two-factor structure accounted for less than 50% of the total variance [14]. These findings suggest that a high percentage of the variance of item responses is not explained by the latent trait, and the PSS literature is still not clear regarding what these other influences could be.

Finally, the PSS-14 has previously displayed adequate reliability in different samples. The internal consistency reliability, measured by the Cronbach’s α [51], was higher than .70 in 11 of 12 studies, while the test-retest reliability was higher than .70 in 2 of 3 studies [14]. However, since Cronbach’s α provides a lower-bound estimate of reliability when items are locally independent [52], the analysis of LD of PSS items is required to ensure that reliability estimates are not inflated [50].

Complementary measures

The perceived stress scale (PSS)

The PSS is a five-point scale (1 = Strongly Disagree, 2 = Disagree, 3 = Neutral, 4 = Agree, 5 = Strongly Agree) with a two-factor structure of perceived Stress (PS) and perceived Coping (PC) which evaluates if a person’s life is perceived as unpredictable, uncontrollable, or overloading [4].

The two complementary measures used in this study in the analysis of criterion validity were:

The Multidimensional Scale of Perceived Social Support (MSPSS): The MSPSS is a 12 item five-point scale (1 = Strongly Disagree, 2 = Disagree, 3 = Neutral, 4 = Agree, 5 = Strongly Agree), with a three-factor structure of family (FA), friend (FR) and significant others (SO) [53]. The MSPSS containing all 12 items (α = 0.93) and the FA (α = 0.92), FR (α = 0.92) and SO (α = 0.95) subscales displayed excellent reliability.
The Efforts-Reward Imbalance Questionnaire (ERI): A shorter version of the five-point scale (1 = Strongly Disagree, 2 = Disagree, 3 = Neutral, 4 = Agree, 5 = Strongly Agree) ERI questionnaire with 11 items was used. The ERI questionnaire has a three-factor structure composed of effort (EF), reward (RD) and over commitment (OC) [40]. The ERI containing all 11 items (α = 0.75) and the ER (α = 0.85) and RD (α = 0.73) subscales displayed adequate reliability. The OC (α = 0.52) subscale displayed poor reliability and for this reason was not included in the analysis of criterion validity.

The Rasch measurement models

The Rasch model (RM) is part of the family of Item Response Theory (IRT) models and it has two distinctive features over other IRT models: (1) the sum score is a sufficient statistic for the person parameter, containing all the information that allows statistical inference about the latent trait [54]; and (2) inference can be conducted on a conditional framework [55], since person and item parameters can be eliminated by means of conditional probabilities [56], a property that Rasch [57] referred as specific objectivity.

A mathematical property of the RM is the conditional independence of item responses to exogenous variables (i.e. absence of DIF) and to other items (i.e. local independence). However, in most rating scales applied in health sciences, items often show evidence of LD and DIF. Therefore, items with LD or DIF do not fit the RM [58] and a common practice has been the deletion of items solely to obtain statistical fit to the model [59, 60]. This practice is problematic; if the deleted items cover important aspects of the construct, there is a threat to content validity [61] that can lead to “construct underrepresentation” [62]. In addition, the revised scale might end up being composed of a small number of items, leading to reduced reliability [58].

For this reason, recent methodological advances consist of analysis by Graphical Loglinear Rasch Model (GLLRM), which extends the RM with additional parameters to incorporate uniform LD and uniform DIF [60]. The term uniform refers to when the magnitude of the conditional dependence between items (LD) or between an item and an exogenous variable (DIF) is constant across the trait level. GLLRM is a combination of two independently developed statistical methods. The first method is the log-liner IRT models developed by Kelderman [63, 64], which generalizes IRT models to relax the assumption of local independence. The assumption of local independence is restrictive and frequently not achieved by questionnaires in health sciences. Therefore, log-liner IRT models allows locally dependent items, while representing traditional IRT models with locally independent items (e.g. Partial Credit model) as a special case. The second method is the development of Graphical models [65], which graphically represent the structure of conditional dependence between variables. Since in the RM the total score is a sufficient statistic for the person parameter, graphical models are suitable for the analysis of LD and DIF. For example, to evaluate DIF, items and exogenous variables should be conditionally independent given the total score. The structure of conditional dependence between items, latent trait and exogenous variables can then be represented graphically.

The functional form of a general GLLRM (containing one LD and one DIF parameter) can be expressed as:

$$ \ln \Big(P\left(Y=\left({y}_1,\dots, {y}_k\Big)|\theta, C\ \right)\right)={\lambda}_0\left(\theta, x\right)+\sum \limits_i\left({\alpha}_{y_i}^i+{y}_i\theta \right)+\sum \limits_{i,j}{\lambda}_{y_i{y}_j}^{i,j}+\kern0.5em \sum \limits_{i,j}{\delta}_{y_i{c}_j}^{i,j}\kern0.5em $$

which describes the conditional distribution of a vector of item responses (y₁, …,y_k) given the latent trait θ and exogenous variables C. The terms $ {\lambda}_0\left(\theta, x\right)+\sum \limits_i\left({\alpha}_{y_i}^i+{y}_i\theta \right) $ are equivalent to the RM for polytomous items (i.e. Partial Credit model), while $ {\lambda}_{y_i{y}_j}^{i,j} $ represents the interaction parameter between item i and item j and $ {\delta}_{y_i{c}_j}^{i,j} $ represents the interaction parameter between item i and exogenous variable j. For an in-depth technical discussion of GLLRMs, please see [59].

The usefulness of GLLRM is that, when questionaries exhibit uniform LD and uniform DIF, departures from the RM do not necessarily imply that items are flawed: locally dependent items convey less information than independent items and lead to reduced reliability; items with DIF require scores to be adjusted to allow comparison between subgroups. However, in both cases, the item serves its original purpose of measuring the latent trait, and retaining these items is important to preserve construct validity. Furthermore, in both cases, the distinctive feature of the RM is preserved: if the uniform LD parameter is included the sufficiency of the total score is retained; while, if the uniform DIF parameter is present, the sufficiency of the total score is retained within the DIF-defined subgroups [59]. Finally, the uniform LD and DIF parameters can inform how items deviated from ideal measurement requirements and become a starting point for modifications on an instrument level [58]. This approach aims to investigate why items did not fit the RM; and when departures consist of uniform LD and uniform DIF, it is possible to retain the items and inform future modifications on the instrument [58].

Statistical analysis

Item analysis

Item analysis was conducted with the following steps: (1) initially testing if the items would fit the RM [66]; (2) if fit to the RM was rejected, the departures were investigated and catalogued; and (3) in case of uniform LD and uniform DIF, the fit to a GLLRM adjusting for these departures was tested. In case of other types of departures, such as items displaying evidence of being a poor measure of the construct, the most problematic item was removed and the three previous steps repeated. The estimation method for the RM and GLLRM was conditional maximum likelihood [55]. Person parameters were estimated using weighted maximum likelihood (WML) [67]. Since missing values for individual items ranged from 0.0 to 1.3%, multiple imputation was not required [68]. All statistical analyses were conducted with the DIGRAM v4.05 [69, 70]. Descriptive statistics and graphs were created with R software [71]. The item analysis included the evaluation of: a) model fit; b) global DIF; c) item fit; d) LD; e) DIF; and f) unidimensionality. After a measurement model was established,: g) reliability and h) targeting of the instrument in this sample was evaluated.

Model fit and global DIF

Overall fit of the model was evaluated through the Conditional Likelihood Ratio (CLR) test [72]. The CLR test evaluates if item parameters are invariant between subsamples. One distinctive feature of items fitting a RM (and GLLRMs, see [59]) is that, within a specific frame of reference (e.g. Australian general population) [57], the functioning of the instrument (e.g. the difficulty of the items) is independent of the sample in which the instrument has been applied. Hence, if items do fit a RM/GLLRM, it is possible to divide the study sample according to a chosen criteria (i.e. lower and higher scores) and item parameters should remain the same in both subsamples. For this reason, the CLR test is a fit statistic to evaluate overall fit to the RM [69]. Moreover, when the sample is divided according to criteria based on exogenous variables (e.g. smokers/non-smokers, men/women) and item parameters were found not to be invariant, the CLR test provides evidence of (Global) DIF. In our study, the subsamples were defined according to lower and higher scores (i.e. homogeneity) to evaluate overall model fit; and according to sex (Male; Female) and education (education level up to High School; Technical education^{Footnote 1} or University) to evaluate Global DIF [54].

Item fit

The investigation of fit at an item level evaluates whether the observed responses to a specific item are in accordance with the responses predicted by the RM/GLLRM model. Fit of individual items was evaluated by conditional infit and outfit statistics, which, differently from traditional infit and outfit statistics, have a known sampling distribution [74].

LD and DIF

To investigate LD and/or DIF, Kelderman’s [64] likelihood ratio (LR) test was conducted to test if the additional uniform LD or uniform DIF parameter would better explain the item responses compared to the fitted model. In addition, the magnitude of the uniform LD or uniform^{Footnote 2} DIF was evaluated through the partial Goodman-Kruskal [75] γ rank correlation between items given the two restscores or between item and exogenous variable given the total score [76]. In case DIF was present, the scores were adjusted and conversion tables reported [59]. When multiple tests were performed, the Benjamini-Hochberg [77] procedure was conducted to adjust for false discovery rate (FDR).

Dimensionality

Initially, the RM and subsequent GLLRMs were tested for the PSS-14 containing all items. In case no fit was found, we then proceeded to test the two subscales composed of negatively and positively worded items. Finally, if a RM or GLLRM was found for each subscale, a formal test of unidimensionality was conducted by comparing the observed γ correlation of the subscales with the expected γ correlation of the subscales under the unidimensional model. The rationale is that the correlation between two subscales measuring different traits is weaker than the expected correlation of subscales measuring the same trait [78]. Negatively worded items (from the “Perceived Stress” subscale) were reverse scored in the dimensionality analysis. Markov graphs [59] were reported to illustrate the final models.

Reliability

In case of fit to the RM, reliability was estimated using Cronbach’s α [51], since it provides a lower-bound estimate of reliability [52] when items are locally independent. However, when LD was found, a Monte Carlo simulation method [79] that adjusts for the LD between items was applied. Since DIF implies that the item thresholds (and, consequently, the item difficulty) change according to subgroup, the different item parameters influence the true score distribution so reliability was calculated for each subgroup independently [80]. In addition, the person separation probability was calculated, which is the probability that the total scores rank two random persons in the same way as the true value of their latent trait (i.e. rather than the estimates).

Targeting

Targeting was evaluated through the Test Target Information Index, which consists of the mean test information divided by the maximum obtained test information. In addition, targeting was evaluated graphically through the inspection of item maps.

Criterion validity

Since scores are ordinal, the convergent and divergent validity of the PSS with other psychological constructs pertaining to its nomological network [34] was evaluated by calculating the non-parametric Kendall’s τ [81]. For this analysis, the complementary measures were used. A negative correlation of Perceived Stress with FA, FR, SO and RW, and a positive correlation with EF and OC was anticipated. In addition, known-groups validity [82] was assessed and it was expected that women would have higher scores on the Perceived Stress subscale [14] but no difference in scores on the Perceived Control subscale [22, 33, 47]. It was also expected that participants with less education would have higher scores on the Perceived Stress subscale [14].

Results

The demographic characteristics of the sample are found in Table 1. Participants age ranged from 18 to 82 years (M = 50.2, SD = 14.8). The majority of participants were women (61.9%), had a tertiary education (67.5%) and were employed (59%).

Table 1 Characteristic of the study participants

Full size table

PSS-14

Fit of the PSS-14 to the RM was rejected (Table 2).

Table 2 Conditional likelihood ratio test of overall model fit and Global DIF

Full size table

The results indicated item misfit (Supplementary Table 2 - Additional file 1) among the majority of items. The analysis proceeded by sequentially excluding items, such as items 4, 5, 9, 12, 13, and 6 that displayed the highest misfit, while investigating departures in terms of LD and DIF with GLLRMs. However, it became clear that: a) LD and DIF could not explain the misfit to the RM and GLLRMs were not found; and b) the majority of excluded items were negatively worded, indicating that they would not form a unidimensional scale together with the positively worded items. At this point, we proceeded to the analysis of the subscales.

Perceived stress subscale

Fit of the negatively worded items (“Perceived Stress”) subscale to the RM was rejected (Table 2). The investigation of item fit statistics (Supplementary Table 3 - Additional file 1) indicated strong misfit of Item 12 ( “… found yourself thinking about all the things you have to accomplish?”) (Infit = 1.675, SE = 0.023, p < 0.001; Outfit = 1.669, SE = 0.023, p < 0.001) (Fig. 1).

Figure 1 shows that the average observed scores exhibited a pattern of under discrimination since they formed a flat curve compared to the model expectations, indicating that item responses were less influenced by the latent trait (“perceived stress”). It was then evaluated whether Item 12 misfit could be a result of DIF or LD (i.e. although LD often results in over discrimination) but a GLLRM was not found. For these reasons, Item 12 was excluded.

After the deletion of Item 12, the CLR test rejected fit to the RM (χ² (23) = 312.9, p < 0.001) and the next item that displayed misfit was Item 8 ( “… felt unable to cope with all the things that you had to do?”) (Infit = 1.145, SE = 0.023, p < 0.001; Outfit = 1.155, SE = 0.023, p < 0.001). The analysis indicated that Item 8 misfit was also not a result of LD or DIF and Item 8 was also excluded.

GLLRM of the perceived stress subscale

After exclusion of the two items, the CLR test rejected fit to the RM but fit to a GLLRM was found (χ² (96) = 94.4, p = 0.440) (Table 2) (Fig. 2).

LD was found between Item 1 ( “… felt upset because of something that happened unexpectedly?”) and Item 2 ( “… felt unable to control the important things in your life?”) (γ_avg = 0.18). DIF was found between Item 1 and sex (γ = 0.24); between Item 3 ( “… felt either nervous or stressed?”) and sex (γ = 0.33); and between Item 1 and education (γ = − 0.14). There was no item misfit (Table 3), and the Kelderman’s LR test indicated no further evidence of DIF or LD (Supplementary Table 4 - Additional file 1).

Table 3 Item fit statistics for the GLLRM of the negatively worded items (“Perceived Stress”)

Full size table

Considering that the GLLRM had overall model fit and there was no further evidence of global DIF, item misfit, DIF or LD, the measurement model for the “Perceived Stress” subscale was established.

Perceived control subscale

Fit of the positively worded items (“Perceived Control”) subscale to the RM was rejected (Table 2). Misfit was found among the majority of items (Supplementary Table 5 - Additional file 1). The item with the highest misfit was Item 9 ( “… felt able to control irritations in your life?”) (Infit = 1.367, SE = 0.026, p < 0.001; Outfit = 1.237, SE = 0.023, p < 0.001) and it was excluded. On the subsequent analysis, substantial misfit was also found regarding Item 13 ( “… felt able to control the way you spend your time?”) (Infit = 1.363, SE = 0.036, p < 0.001; Outfit = 1.180, SE = 0.032, p < 0.001), Item 4 ( “… dealt successfully with irritating life hassles?”) (Infit = 1.226, SE = 0.024, p < 0.001; Outfit = 1.185, SE = 0.024, p < 0.001) and Item 5 (“...effectively coped with important changes in your life?”) (Infit = 1.571, SE = 0.024, p < 0.001; Outfit = 1.501, SE = 0.024, p < 0.001) and these items were removed.

GLLRM of the perceived control subscale

After the exclusion of the misfitting items, the CLR test indicates fit to a GLLRM (χ² (55) = 62.5, p = 0.224) (Table 2) (Fig. 2). LD was found between Item 7 ( “… felt things were going your way?”) and Item 10 ( “… felt you were on top of things?”) (γ_avg = 0.22). DIF was found between Item 10 and sex (γ = − 0.23); between Item 6 (“...felt confident about your ability to handle your personal problems?”) and sex (γ = − 0.15); and between Item 10 and education (γ = − 0.17). There were no further evidence of item misfit (Supplementary Table 6 - Additional file 1) or LD/DIF (Supplementary Table 7 - Additional file 1). Considering that the GLLRM had overall model fit and there was no further evidence of global DIF, item misfit, LD or DIF, the measurement model for the “Perceived Control” subscale was established.

The table for adjusting scores after accounting for DIF is provided for both subscales (Supplementary Table 8 - Additional file 1).