### The Rasch model

The previously mentioned requirements of invariance for measurement are basically requirements of the data. The Danish mathematician Georg Rasch formalised these measurement requirements of the data in a mathematical model which is unidimensional and probabilistic [6]. Since invariance is an integral property of the Rasch model, any test of the fit between the data and the model is a test of the extent to which the data show invariant properties with respect to the criterion of invariance tested, i.e. if an instrument works invariantly across individuals or across sample groups depending on which test of invariance is assessed.

The Rasch model can be used for analysis of dichotomous [6] as well as polytomous data [7]. In principle there are only two kinds of parameters to be estimated in the Rasch model, item and person parameters which enter into the model additively. The Rasch model enables these parameters to be estimated independently of each other, in accordance with the requirements for measurement stated by Rasch [6, 8,9,10] and Thurstone [11]. The estimated parameters which take the form of person and item location values are placed on a common logit scale where the location of the items relative to the persons becomes apparent. This also enables examinations of the operating characteristics of the items along the whole continuum of a latent trait using Expected Value Curves (EVCs). These curves predict the item scores as a function of the item parameters and person locations on the latent trait. Ideally, the observed means of persons in adjacent class intervals should fit closely to the expected values of the curve. Misfit between the observed means and the EVC, which is a manifestation of lack of invariance across the variable, may appear as either under or over discrimination of an item relative to the EVC. Such misfit will, more or less, affect comparisons between persons along the latent variable.

### DIF with reference to the expected value curve

DIF relative to the EVC may also be examined with respect to sample groups, e.g. gender. Thus if only one EVC is required to predict the item scores irrespective of groups, then there is no DIF; on the other hand, if separate EVCs are required for an item, one for each group, then the item shows evidence of DIF. If the DIF is the same along the latent trait implying parallel EVCs then DIF is referred to as uniform; if the DIF varies along the latent trait implying non-parallel EVCs, DIF is referred to as non-uniform [12].

Recent work on DIF has demonstrated that a distinction also has to be made between *real* and *artificial* DIF [12, 13]. Real DIF is inherent to an item and affects the person measures, while artificial DIF does not. Artificial DIF is an artefact of the procedure for identifying DIF and is common to most procedures for identifying DIF [12, 13], including the popular Mantel–Haenszel (MH) procedure. Failure to distinguish between real and artificial DIF may affect person measurement.

### Causes and determinants of artificial DIF

There is no DIF if the observed means is the same for persons from different sample groups given the same *location* on the latent variable. The person locations are, however, not generally known in advance but are estimated as a part of the procedure to identify DIF. Hence, the unknown person locations are substituted by their estimates. This substitution is the source of artificial DIF with most procedures for detecting DIF, including the MH procedure.

Given that grouping persons by total scores in the Rasch model is equivalent to grouping persons according to their estimates, Andrich and Hagquist [13] further explained the source of artificial DIF:

*“Grouping persons by the estimate provides a constraint on the sum of the estimated probabilities (and proportions) of a positive response across all items, given the same total score. Thus the sum of the probabilities, or proportions, of positive responses across items of persons with a total score of r must be r. Therefore, if because of real DIF in one item favoring one group the probability (or proportion) is greater in that group, artificial DIF which favors the other group must be induced in the other items.”* (p. 413)

Although this was written with reference to dichotomous data, the same principles hold also for polytomous data and both uniform and non-uniform DIF. Because real DIF in one item is distributed as artificial DIF across all other items, the magnitude of artificial DIF is determined by the number of items with real DIF, the magnitude of real DIF, the direction of the DIF, the total number of items and the location of the items relative to the distribution of the persons [12,13,14].

The location of the items relative to the distribution of the persons does not have any impact on the direction of uniform DIF (e.g. favouring one group or the other), while non-uniform DIF is affected [14].

Neither in uniform DIF nor in non-uniform DIF, does artificial DIF balance out real DIF with respect to group differences in the person estimates. However, the effects of real DIF on person measurement are more pronounced in uniform DIF than in non-uniform DIF [14].

### Data

Swedish data from the Health Behaviour in School-aged Children (HBSC) study were used. The HBSC study is conducted in collaboration with the World Health Organisation since the 1980s. The HBSC-study includes students in grades 5, 7 and 9 [15]. Data were collected with questionnaires which were completed anonymously in school classrooms. Participation was voluntary. In the present study only data from 11,068 grade 9 students are used, collected at seven points in time during the 1985–2014 time periods.

### Instrument

A composite measure of psychosomatic problems was constructed by summation of the responses to eight questions about headache, stomach ache, backache, feeling low, irritability or bad tempered, feeling nervous, difficulties in getting to sleep and feeling dizzy.

The response categories for all of these eight items, which are in the form of questions, are ‘About every day’, ‘More than once a week’, ‘About once a week’, ‘About once a month’ and ‘Seldom or never’. The categories are ordered in terms of implied frequency and the higher frequency, the higher degree of psychosomatic problems.

### DIF-analysis using ANOVA of residuals

The DIF-analysis was conducted using the polytomous Rasch model [16]. Because the data are used for illustrative purposes, only gender DIF is analysed while also DIF across time as well as other violations of the Rasch model may occur.

To hypothesise real DIF items, in the present paper we make use of a two-way analysis of variance of residuals given the Rasch model item and parameter estimates where one factor has class intervals along the variable and the other has the designated groups [17]. Because the ANOVA estimates and separates main and interaction effects, the procedure allows for simultaneously testing of uniform as well as non-uniform DIF among a priori specified sample groups. In addition, the ANOVA generates an overall test of item fit along the continuum irrespective of the defined groups (e.g. gender) based on adjacent class intervals approximately equal size. In that respect the ANOVA comprises an all-in-one procedure to simultaneously identify possible real DIF among groups and possible DIF along the latent trait, in contrast to commonly used two step procedures based on logistic regression where the fit of the items along the continuum is examined separately irrespective of groups, and with a different software, before the person measures are included in the logistic regression analysis [18].

The ANOVA analyses the standardised residuals of responses from the estimated EVC. The F-values calculated in the ANOVA give the rank order for each item corresponding to the magnitude of DIF.

The standardised residual *z*
_{
ni
} of each person (n) to each item (i) is given by

$$ {z}_{ni}=\frac{x_{ni}-E\left[{x}_{ni}\right]}{\sqrt{V\left[{x}_{ni}\right]}}. $$

For the purpose of a detailed analysis, each person is identified by the gender group (g), and by the class interval (c). This gives the residual \( {z}_{n_{cg}i} \)

$$ {z}_{n_{cg}i}=\frac{x_{n_{cg}i}-E\left[{x}_{n_{cg}i}\right]}{\sqrt{V\left[{x}_{n_{cg}i}\right]}} $$

The ANOVA determines whether there is a main gender effect, a class interval effect, or an interaction between the class interval and gender.

The DIF-analyses were conducted with a sample size adjusted to the value of the order of 960 with the Bonferroni adjustment [19] of significance values applied for a Type I error level of 0.05.

The Rasch analysis was performed with the software RUMM2030 [20].

### Resolving items and quantifying DIF

Because the F- values in the ANOVA of DIF provide only relative ordering for the magnitude of DIF, to establish quantitative values of DIF a complementary approach is required. This can be obtained by resolving an item identified to have potential real DIF into multiple items, one for each group, and comparing the estimates of the item parameters from the different groups. When an item is resolved, responses for all groups except the designated group become structurally missing. To estimate the parameters in the presence of structurally missing responses where not all persons respond to all items, which is not an impediment in most software used to analyse responses with Rasch models, principles of test equating [4] are applied. Although some persons have not responded on all items, if the items work invariantly, comparable estimates of item and person locations on a common logit scale are provided.

Testing the differences between item location values and slope values provides a measure of the size of the magnitude of uniform and non-uniform DIF respectively. Because real DIF in one item will induce artificial DIF in all other items, DIF has to be resolved sequentially item by item, starting with the item showing the largest DIF. After resolution, real DIF in an item does not generate artificial DIF in other items [13].

In Fig. 1 the sequential procedure for detecting and resolving items showing evidence of DIF is shown.

While removing an item will decrease the reliability and person separation, resolving DIF will only have a very small, if any, effect on the reliability and person separation. Therefore, it is usually preferable to resolve an item instead of removing it. Because resolving an item, like removing an item, may affect the validity, from that perspective resolving DIF is only justified if the source of DIF can be shown to arise from some source irrelevant to the variable of assessment and therefore deemed dispensable. This will be discussed further at the end of the paper.

Although items showing evidence of DIF should not be resolved without external information about the source of the DIF, in the present analyses we are sequentially resolving items only based on statistical misfit in order to illustrate the impact of artificial DIF.