Property | Definitions and Acceptability Criteria |
---|---|

Targeting | Targeting refers to the extent to which the range of the target construct measured by each of the scales (i.e., perceived health risk and perceived addiction risk) matches the range of that target construct in the study sample. Better targeting equates to a greater ability to interpret the psychometric data with confidence [50]. This involves examination of the relative distributions of the item locations and the person measurements as well as of the plot of the person-item location distributions, showing the item locations and the person measurements on a common scale. There is no specific criterion. Essentially, the item locations should cover the sample adequately and the sample should cover the item locations adequately. |

Fit |
The items of the scales of the proposed instrument must work together (fit) as a conformable set, both conceptually and statistically. Otherwise, it is inappropriate to sum item responses to a total score and consider the total scale score as a measure of the target construct. When items do not work together (misfit) in this way, the validity of the scale is questionable [50]. The following statistical and graphical indicators of fit were investigated [51]: • Item discrimination: Fit residuals summarize the difference between observed and expected responses to an item across all respondents (item-person interaction). Fit residuals should ideally lie within ±2.5. Fit residuals lying outside this range imply misfit of the observed data to the Rasch model. Negative values indicate overdiscriminating and positive values underdiscriminating items. Due to the large sample size in Surveys 1 and 2 it was to be expected to find a substantial number of item misfits, but this indicator was still considered helpful as some items were expected fitting much worse than others. • Item fit: Chi-squared values summarize the difference between observed and expected responses to an item for groups (or ‘class intervals’) of individuals with relatively similar levels of ability (item-trait interaction). A chi-squared value with a low likelihood ( p-value) implies that the discrepancy between the observed responses and the expected value is large relative to chance for that item.• Item response ordering: This involves the examination of the category probability curves (CPCs) and the threshold probability curves (TPCs) which show the ordering of the thresholds for each item. A threshold marks the location on the latent continuum where two adjacent response categories are equally likely. The ordering of the thresholds should reflect the intended order of the categories lower (‘no risk’) to higher (‘high risk’) values. Correct ordering supports the assumption that the response categories work as intended. Disordered thresholds indicate that the response categories for a particular item are not working as intended, and therefore that the scoring function for that item is not valid. • Local independence: This involves an examination of item residual correlations [52]. Correlations between the residuals should be low (< 0.30). In addition, residual correlations are assessed against the average of all residual correlations plus 0.3 [53, 54]. If residuals for item pairs are correlated > 0.30, this indicates that the response to one item depends on the response to the other item, i.e., the items are locally dependent [55]. |

Reliability | Reliability refers to the extent to which scale scores reflect random error [56]. This was assessed using the person separation index (PSI), which is an internal reliability statistic comparable to Cronbach’s alpha. The PSI quantifies the error associated with the measurements of individuals in the sample [56]. The PSI ranges from 0 (all error) to 1 (no error). A low PSI implies that scale items are not able to reliably separating individuals on the scale they define. |

Stability | Comparability of PRI measures across different factors was based on tests of invariance (key criterion of successful measurement), implying that items mean the same to different participant groups under different conditions. This is assessed by means of a test for differential item functioning (DIF) [57]. Invariance was assessed according to demographic criteria (age, gender, education) as well as across different tobacco and nicotine-containing products, different subpopulations based on smoking status and across the application of the scales to perceived personal risk and perceived general risk. DIF is assessed by comparing observed residuals (i.e., the difference between expected responses under the assumption of no DIF and actually observed responses) across groups of participants defined by the DIF factor investigated (e.g., males versus females) and classified in several class intervals along the latent continuum measured by the scale. |