Orofacial esthetics (OE) or appearance is a dimension of oral health-related quality of life [12], a comprehensive and important concept to characterize how individuals perceive their oral health, and it can be measured by the Orofacial Esthetics Scale (OES). This scale was originally developed in prosthodontics patients, but the present study extends the instrument’s use to the general population. For the adult general population, we provide evidence for the reliability and validity of OES scores that characterize the construct OE with a single summary score.

### Limitations and interpretation of dimensionality findings

Our dimensionality findings don’t agree completely with each other. Visual inspection of the correlation matrix (“intuitive factor analysis,” according to Gorsuch [15]) favored OES’ unidimensionality. EFA also supported unidimensionality according to several criteria. CFA findings were not so straightforward. The hypothesis of a unidimensional model was rejected by the chi-square test, and model fit indices were acceptable only for one out of the five selected measures.

How can this discrepancy between the EFA and CFA be explained when, conceptually, the two methods should lead to the same conclusions? The two methods differ in their criteria for what is adequate model fit. For EFA, the substantial first latent factor and the substantial eigenvalue differences between the first and subsequent latent factors (Kaiser criterion, Screeplot) were sufficient to view OE as unidimensional. The CFA applies different criteria. The chi-square test rejected unidimensionality. This is not too surprising because this test is sensitive to sample size. For models with more than 400 subjects (we analyzed 579 and 580 subjects in the two sets), the chi-square statistic is almost always statistically significant [16]. When exploring the SRMR, the only fit index that does not include the chi-square value, a different picture emerged. Conceptually, the SRMR represents the average discrepancy between the correlations observed in the sample correlation matrix and the model-predicted correlations. The SRMR was between 0.03 and 0.06 for all models. In our opinion, this is small in absolute and relative magnitude (taking the average inter-item correlation of 0.66 into account). On average, discrepancies between observed and predicted correlations were reasonable. In addition, individual residuals were by and large acceptable. Assessing individual residuals to detect “localized areas of strain” is commonly recommended [17]. It was also recommended that fit indices should not even be computed for small degree of freedom models (such as ours), but rather the source of specification error should be identified [16]. We followed that recommendation and identified only two fitted residuals out of the 21 correlations that were larger than 0.10 – a rule of thumb recommended for adequate fit in the SEM literature [6].

That CFA is unable to confirm EFA results has been observed before [18, 19] and it has been pointed out that the two techniques are not fully comparable [20], e.g., in their criteria to evaluate models as we discussed above. In our data, findings were only slightly different across methods. The strong latent factor was sufficient for EFA to view OE as unidimensional, whereas the CFA viewed the items *face* and *profile* as indicators for a second factor worthwhile to be identified for increased model fit. However, statistical significance is different from clinical relevance and the last step of a CFA – to consider equivalent models – provides interesting insight into the construct OE. Equivalent models have identical goodness of fit but different substantive interpretations [21]. Among several equivalent models, we considered a two-factor model (model C, Figure 2) and a hierarchical model (model D, Figure 2) as important alternatives. This two-factor model is different compared to the 35 two-factor models we investigated in the first data set. This model has only two items for the second latent factor, which is the minimum for identification [6], compared to three indicators we used for more robust factor identification according to recommendations [22]. The interpretation of this model, and also the hierarchical model which just adds a second-order factor summarizing the OE construct, is that OE may have an extraoral and an intraoral component. This seems plausible. Facial (=extraoral) and dental (=intraoral) esthetics are well-known terms in dentistry representing these concepts. For example, facial and dental appearances were distinguished in patients with bilateral cleft lip and palate [23]. Another study showed that esthetic dental and facial measurements were important factors for patient satisfaction and should be considered in esthetic anterior oral rehabilitation [24].

Summarizing all factor analytic results, the reliability as well as the validity findings, we recommend a simple characterization of the construct OE with one summary score. While we have not investigated other types of validity and reliability that could also be informative about the dimensionality of OEs, at this moment, we don’t consider the possible distinction between intra- and extraoral esthetics as worthwhile to be described by two scores. However, we believe that future studies should explore this further.