Data source
Data were from the Winnipeg Regional Health Authority Joint Replacement Registry; the Health Authority is the largest health region in the province of Manitoba, Canada and has a population of more than 700,000 residents. The province has a single-payer health care system that provides necessary hospital, medical and surgical services to all individuals eligible to receive health services. The Registry captures more than 90% of the joint replacement procedures conducted within the health region and more than three-quarters of all procedures in the entire province.
The Registry was initiated in 2004 with partial capture of information on all joint replacement surgeries; this was expanded to full mandatory capture of information in 2005. The Registry has been described in detail elsewhere [18]; it contains patient demographics, comorbid conditions, surgical technique, implant details, and complications. Both general and condition-specific HRQoL measures are included in the Registry. The former includes the SF-12 and the latter includes the Oxford Hip and Knee scores [19, 20]. Pre-operative data capture occurs in the pre-admission clinic under the guidance of a clinic nurse. Post-operative data are collected via a mail-out questionnaire conducted by Registry staff. Data entry is undertaken by the hospital medical records department for hospital stay characteristics and by Registry staff for PROMs. All data are collected via standardized instruments and the process of data collection and entry is overseen by Registry staff for all hospital sites.
The study cohort included all individuals who underwent THA or TKA between April 1, 2009, and March 31, 2015 and for whom complete pre-operative data were available. All patients from one hospital were excluded in 2011 because pre-operative questionnaires were not distributed that year.
Measures
The SF-12 (version 2) is a general-purpose instrument consisting of 12 items that comprise eight sub-domains [21]: physical functioning, role physical, bodily pain, general health, vitality, social functioning, role-emotional, and mental health. The eight sub-domain scores can be weighted and summarized into MH and PH sub-scale scores. According to this model, the items from the physical functioning, role-physical, bodily pain, and general health sub-domains are indicators of PH while vitality, social functioning, role-emotional, and mental health items are indicators for MH. Assessments of construct validity using latent variable models has confirmed this measurement structure [21, 22], although correlations of residual errors for items associated with PH and MH latent variables has been observed [21,22,23].
Covariates used to describe the study cohort and examine potential DIF sources for the SF-12 included sex, age group, body weight status, and multimorbidity, the presence of two or more chronic conditions [24]. Age was classified as 60 years or less (reference category), 61 to 70 years, and greater than 70 years; the dummy variables AGE1 (0 if age ≤ 60 and 1 otherwise) and AGE2 (0 if age ≤ 70 and 1 otherwise) were created to represent these age categories. Body weight status was based on body mass index (BMI), which was calculated from measured height and weight (kg/m2) captured by clinic nurses; it was categorized as underweight or normal weight (BMI ≤ 25.0; reference category), overweight (25.0 < BMI ≤ 30.0), and obese (BMI > 30.0) [25, 26]. The dummy variables of BMI1 (0 if BMI ≤ 25.0 and 1 otherwise) and BMI2 (0 if BMI ≤ 30.0 and 1 otherwise) were created to represent these categories.
Information about 14 chronic conditions was captured from a self-report questionnaire administered by clinic staff at the pre-operative occasion; individuals were classified as having multimorbidity if they had at least two of these chronic conditions. A single dummy variable COMORB (1 = presence of 2+ comorbid conditions and 0 otherwise) was defined.
Statistical analysis
The analyses were conducted for patients with complete information (i.e., no missing data) on all SF-12 items. Descriptive analyses were conducted using frequencies and percentages. All analyses were stratified by joint type.
A variety of methods have been used to detect DIF including logistic regression [27], item response theory (IRT) models [28, 29], and the multiple indicators multiple causes (MIMIC) model [30,31,32]. IRT and MIMIC models can be applied to binary and ordinal item responses, and are flexible to incorporate one or more latent constructs. In addition, the MIMIC is flexible to allow for the specification of dependencies between item residuals [23, 33]. Consequently, we adopted the MIMIC model to test for uniform DIF.
We constructed baseline models for MH and PH sub-scales based on the hypothesized measurement structure of the SF-12, in which the PH and MH items have no cross loading items (Additional file 1: Figures S1 and S2). The baseline models included two correlated residuals (items P2 and P3, P4 and P5) for the PH sub-scale and two correlated residuals (items M1 and M2, M3 and M5) for the MH sub-scale [21, 23] and confirmed by the assessment of fit measures, which demonstrated poorer overall fit when these residuals were not correlated.
In a MIMIC model with m items and k covariates, the latent response for the ith item (i = 1,…, m) is regressed on the latent variable F and the covariate vector Ζ,
$$ {y}_i^{\ast }=\backslash {\mathrm{uplambda}}_iF+{\boldsymbol{\beta}}_i^{\prime}\mathbf{Z}+\backslash {\mathrm{upvarepsilon}}_i, $$
(1)
where εi is the error term, λi is the factor loading, and \( {\boldsymbol{\beta}}_i^{\prime }=\left(\backslash {\mathrm{upbeta}}_{i1}\dots \backslash {\mathrm{upbeta}}_{ik}\right) \) is the vector of the effects of covariates on the latent response \( {y}_i^{\ast } \). The latent response is scored via a threshold model
$$ {y}_i=c,\kern0.5em if\kern0.5em {\tau}_{i(c)}<{y}_i^{\ast}\le {\tau}_{i\left(c+1\right)}, $$
(2)
for categories c = 0, 1, 2, …, C – 1, where τi(0) = − ∞ and τi(C) = + ∞. Thus, yi is a polytomous variable which takes discrete values 0, 1, …, C – 1. In addition, the latent factor is regressed on the covariates via
$$ F=\backslash {\mathrm{upgamma}}^{\prime}\mathbf{Z}+\eta, $$
(3)
where η is the error term and is independent of Z, and γ′ = (γ1, … , γk) is a vector of regression coefficients that describe between group differences in F (Fig. 1). These formulations enable us to estimate and test \( {\boldsymbol{\beta}}_{\boldsymbol{i}}^{\prime } \) conditional on F. If \( {\boldsymbol{\beta}}_{\boldsymbol{i}}^{\prime}\boldsymbol{\ne}\mathbf{0} \), there is a significant direct effect from the covariates to the latent response \( {y}_i^{\ast } \) which means that DIF exists in the ith item [34, 35].
There were four primary steps in the DIF analysis. First, unidimensionality of the measurement scales was assessed. Next, anchor items were selected. Then, each item was assessed for DIF. Finally after adjustment for DIF, the contributions of the covariates and items to the final DIF model was assessed.
In the first step the unidimensionality assumption, which implies that all sub-scale items measure a single latent construct, was examined by applying a single-factor model with an oblique rotation to the polychoric correlation matrix for the items for each of the MH and PH sub-scales. To make a decision about unidimensionality, we used two criteria: (a) the existence of only one eigenvalue greater than one, and (b) a large value for the ratio of the first to second eigenvalues (i.e., r > 4) [36]. We used several criteria to evaluate the goodness-of-fit of a single-factor model. We considered the model to be a reasonable fit to the data if it had a small root mean square error of approximation (i.e., RMSEA < 0.06), a large comparative fit index (i.e., CFI > 0.95), a large Tucker-Lewis Index (i.e., TLI > 0.95), and a small weighted root mean square residual (i.e., WRMR < 1.0) [37,38,39].
In the second step, we selected anchor items (i.e., DIF-free items). At least one anchor item must be selected to define the latent construct on which the groups are compared. We used the following method to select the anchor item(s). First, for each sub-scale, a single-factor model was fit to the data; it included direct effects of the covariates on the latent variable but no direct effects between the covariates and the sub-scale items. This was the base model. Next, a series of single-factor models were fit to the data that added direct effects from the covariates; there was one model for each sub-scale item. A χ2 difference test was used to compare the models with and without the direct effects. The item(s) with the smallest χ2 statistics was(were) selected as the anchor item(s) [40]. Note that this process was applied to the data for all cohort members so that the same anchor items were selected for both THA and TKA patients. This facilitated the interpretation of the study findings because the same item(s) served as reference points for all analyses. We confirmed the same anchor items in separate analyses for THA and TKA patients.
In the third step, item purification was conducted to identify the items affected by DIF. First, a full model was fit to the data that included direct effects from covariates to all sub-scale items except the anchor item(s). Next, we fit a series of reduced models that excluded direct effects from the covariates to each item; this was done one item at a time. A χ2 difference test was used to compare these nested models using DIFFTEST for the robust weighted least square estimation method (i.e., WLSMV) in Mplus (https://www.statmodel.com/chidiff.shtml). A large χ2 difference statistic implies uniform DIF is present for the item.
The fourth step was to fit a model that included direct effects from the covariates to all DIF items (i.e., the items for which DIF was identified in the previous step) and direct effects of the covariates on the latent variable [9, 31]. This model was used to obtain parameter estimates of direct effects of the covariates on the PH and MH sub-scale items. The total effect of DIF was measured via the relative difference between standardized coefficient estimates for the DIF and No-DIF models (i.e., difference in standardized estimates divided by the standardized estimates for the No-DIF model). A difference in standardized coefficients of 0.20 was considered as small, 0.50 as moderate, and 0.80 or greater as large [41]. Estimates of the total effects (i.e., direct and indirect effects) of the covariates on the individual sub-scale items were also produced.
We used an approach based on dominance analysis [42] and Nagelkerke’s coefficient of determination [43,44,45] to assess the relative importance of both individual items and covariates in the DIF models. Specifically, an item’s importance in the final DIF model was estimated based on its contribution (i.e., direct effects from the covariates to the item) conditional on the contributions of the other items. To measure the item’s importance, a full model was fit to the data that include direct effects of the covariates on all DIF items identified in the previous step, as well as direct effects of the covariates on the latent variable. Next, we fit a series of reduced models that excluded direct effects from the covariates to each DIF item; we did this one item at a time. The importance of each DIF item was assessed using an adaptation of Nagelkerke’s coefficient of determination,
$$ {R}^2=\left(1-{e}^{-\left(\Delta {\chi}^2\right)/N}\right)/\left(1-{e}^{-{\chi}_R^2/N}\right), $$
(4)
where N is the total sample size, \( {\chi}_R^2 \) is the chi-square test statistic for the reduced model, and Δχ2 is the scaled difference in χ2 test statistics for the reduced and full models. The statistic R2 is equal to Nagelkerke’s coefficient of determination if we replace \( {\chi}_R^2 \) with −2 Log(LR) and Δχ2 with −2 Log(LR/LF) in maximum likelihood estimation, where LR and LF are the likelihood of the reduced and full models, respectively. An item was more important than all other items if it had the largest R2 amongst all items.
The importance of a covariate in the final DIF model was measured by its contribution (i.e., direct effects from the covariate to all DIF items), conditional on the contribution of the other model covariates. We used a similar approach to that described above to measure covariate importance in the final DIF model. First, a full model was fit to the data that include direct effects from all covariates to the DIF items, as well as direct effects of the covariates on the latent variable. Next, a series of reduced models were fit to the data that excluded the effect of each covariate; this was done one covariate at a time. Using the adapted Nagelkerke coefficient of determination, we measured the importance of each covariate. A covariate was more important than all other covariates if it had the largest R2 amongst all of the covariates.
All analyses were conducted using Mplus software, version 8. In all analyses, the latent factor mean was constrained to zero and its variance was fixed to one.