A critical review to grading systems and recommendations of traditional Chinese medicine guidelines

Objectives To investigate how many traditional Chinese medicine (TCM) guidelines adopted a grading system and the differences among them, and the distribution of level of evidence used to support TCM recommendations. Methods A comprehensive search of relevant guideline webpages and literature databases were undertaken from inception to August 2018 to identify guidelines including TCM interventions. Two independent reviewers extracted the information about grading systems and recommendations. Results One hundred forty-two TCM guidelines were included, among which, 68 (47.9%) adopted a total of eight grading systems. The definitions, letters, and codes among these systems varied significantly. A total of 1284 recommendations were extracted from included TCM guidelines. More than 60% recommendations were based on a low and very low level of evidence (level C:33.4% and level D: 30.2%). Only 7.8% recommendations were rated as strong recommendation (grade I), while 76.2% recommendations were rated as conditional recommendation (grade II). Conclusions Various grading systems were used in TCM guidelines, which might confuse guideline users. The low proportion of high level of evidence in TCM recommendations might downgrade the confidence to TCM interventions.


Introduction
China is the only country where western medicine and traditional Chinese medicine (TCM) are practiced alongside each other at every level of the healthcare system. and traditional Chinese treatments account for about 40% of the total [1]. Over the past two decades, an increasing number of TCM guidelines have been developed by academic associations and government organizations in China [2,3]. The Chinese guideline clearinghouse (CGC) indexed more than 1 hundred TCM guidelines. At least a thousand recommendations were generated by providing enormous diagnostic and therapeutic options for clinical practitioners.
When guideline panels, especially evidence-based guideline panels decide to recommend an interventional or diagnostic option for guidelines users, they must make judgments about the quality of evidence and strength of recommendations based on the evidence related to the specific context and other factors. The quality of evidence reflects the extent to which confidence in an estimate of the effect is adequate to support a particular recommendation [4,5]. The strength of a recommendation indicates the extent to which one can be confident that adherence to the recommendation will do better than harm [4]. Normally, the higher the quality of evidence, the more likely a strong recommendation can be made [6]. However, several previous surveys indicated that many recommendations in important domains (such as oncology, cardiology, infectious disease, and screening) were based on a low level of evidence [7][8][9], even in the World Health Organization (WHO) guidelines, strong recommendations based on low levels of evidence are frequent [10].
The quality of evidence and strength of recommendations are necessary as they could help to communicate a clear message, so as to help guideline users, readers and stakeholders to understand recommendations quickly and concisely, and more and more international guidelines organizations, such as Scottish Intercollegiate Guidelines Network (SIGN), National comprehensive cancer network (NCCN), National Institute for Health and Care Excellence (NICE) and WHO are applying a grading system to rate the quality of evidence and strength to their recommendations. However, the grading systems adopted by different guideline organizations varied in the definitions, letters, and codes [11], for example, the codes of GRADE system [12], Oxford system [13], SIGN system [14], and NCCN system [15], fall primarily into three categories: letters (e.g., A, B, C, etc.), numbers (e.g., I, II, III, etc.) and mixed letters and numbers (e.g., Ia, Ib, IIa, etc.).
Limited studies have focused on the varied grading systems in guidelines. Holger J, et al. [16] investigated the letters, numbers, symbols and words of grading systems from different guideline organizations in Englishspeaking countries. Andre I, et al. [17] investigated the level of evidence of recommendations in pediatrics guidelines. While no published study was found to investigate grading systems in TCM guidelines. Thus, this study aimed to investigate: 1) how many TCM guidelines adopted a grading system and the differences among them; 2) the distribution of level of evidence and strength of recommendation in TCM guidelines.

Identification of guidelines
Chinese Guideline Clearinghouse and PubMed were searched as an international database, and Wanfang Data Knowledge Service Platform, VIP Online Publishing Platform, China National Knowledge Infrastructure (CNKI) and SinoMed were searched as domestic databases for TCM guideline. Search terms in electronic databases included "guidelines", "statement", "recommendations", "traditional medicine", "traditional Chinese medicine", and "Chinese herbal medicine". Additional file 1 showed the detailed searching strategy. The searches were initially conducted in December 2016 and updated in August 2018 to include more TCM guidelines. Other sources such as Google, Amazon and Dang-dang website were searched for TCM guidelines published in books. Additionally, the reference lists of obtained guidelines were searched to include more potential guidelines.

Selection of guidelines
The including criteria were as follows: (1) complete guideline text is available in English or Chinese, and (2) guideline contains recommendations regarding TCM interventions. If a guideline had updates, only the most recent version would be assessed. The following literature was excluded: duplicate guidelines, guidelines for patients, editorials, translations of guidelines, secondary or multiple publications and short summaries. We also excluded "de novo" or adaptations of external guidelines, as they did not generate their own recommendations.

Data extraction
For each included TCM guideline, two reviewers will independently extract the following information by using a standard form: 1) characteristic information: guideline organizations (i.e. ministry of health, medical doctor association, Chinese medical institute, China Academy of TCM and hospital), year of publication, journal, publication type (e.g. CSCD journal, non-CSCD journal, and book), scope of guideline (prevention and treatment, prevention, diagnosis and treatment, treatment, technology and comprehensive) and funding information, classified into four categories: industry, government, academic association and not reported; 2) grading systems information including: any form of grading systems used for rating the level of evidence and/or strength of recommendations, and the codes, letters, numbers, and symbols of the grading system were also extracted when available; 3) recommendation information: the exact recommendations of each TCM guideline were extracted and standardized into treatment or diagnostic categories. The Consensus was reached by discussing when recommendations were not obvious in guidelines, and disagreements were settled by discussing with a third author (Y.D.L).

Primary measures
The proportion of different levels of evidence and strength of recommendations in TCM guidelines was estimated and compared among subgroups. Considering various grading systems might be used to evaluate the level of evidence and strength of recommendations, a composite grading system was generated to represent all recommendations according to a study published in Chest [18].
The level of evidence was classified into four categories: grade A (high), the evidence used for supporting the recommendation was based on randomized controlled trials without important limitations, or meta-analysis; grade B (moderate), evidence behind the recommendation was based on randomized controlled trials with important limitations, or upgraded observational studies; grade C (low), evidence for recommendation was based on non-randomized studies, cohort or case-control studies, case series; and grade D (very low), the recommendation was made by expert opinion.
The strength of recommendation was adapted into three categories: level I (strong recommendation), recommendation can apply to most patients in many circumstances; level II (conditional recommendation), the best action may differ depending on circumstances or patients' or societal values; and level UG, insufficient evidence on which to formulate a recommendation.

Subgroup analysis
The

Data analysis
Statistic tests of significance might not be necessary as we included the entire cohort of TCM guidelines. Instead, we performed a descriptive analysis using the calculation of the percent distributions of recommendations among quality of evidence and strength of recommendations. Agreement between each reviewer's data extraction was tested by using a two-way ANOVA with single-rater two-way intra-class correlation coefficients (ICCs). According to Landis and Koch [19], the degree of agreement between 0.01 and 0.20 was deemed minor, 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 substantial, and 0.81-1.00 very good. A value of P < 0.05 denoted statistical significance. All tests were twosided. Statistical analyses were conducted using SPSS version 19.0 (SPSS Inc., Chicago, IL, USA).

Search results
We yielded 2380 records from databases, guideline websites and manual searches after excluding duplicated records, of which 166 records were considered to be potentially relevant; after selection, a total of 142 TCM guidelines were satisfied the inclusion criteria (Fig. 1). Table 1 showed the characteristics of 142 TCM guidelines. 69.8% (n = 99) TCM guidelines published in 2008-2012 (Fig. 2), 50.0% (n = 71) TCM guidelines developed by Chinese medical institute and 197% (n = 28) developed by China Academy of TCM. 86.6% (n = 123) TCM guidelines focused on diagnosis and treatment. In terms of funding, 53.5% (n = 76) TCM guidelines were supported by the government, and 19.7% (n = 28) funded by medical association, however, none TCM guidelines reported the conflicts of interest statement. The ICC value for data extraction process was 0.87, 95% confidence interval 0.83 to 0.90, indicating a good agreement between reviewers.

Evidence grading systems
In the 142 TCM guidelines, 52.1% did not adopt any grading system, eight grading systems were found in the rest 47.9% TCM guidelines. Table 2 showed the codes of quality of evidence and strength of recommendation for reported grading systems. The most often used system was TCM grading system developed by Liu Jianping (Director of evidence-based center of Beijing University of Chinese medicine) team [20], which was used in 53 TCM guidelines, followed by GRADE system [12] [22], and David Sackett system [23] were used by only one TCM guideline respectively.

Level of evidence and strength of recommendation
A total of 1284 recommendations were extracted from 142 included guidelines. Of recommendations with levels of evidence, more than 60% based on level C and D evidence (33.4 and 30.2%, respectively) ( Table 3). Among recommendations with a strength of recommendation, only 7.8% were rated as strong (level I), 76.2% as weak (level II), and 16.0% as level UG. Table 1 The characteristics of included studies

Discussion
Our results showed that eight evidence grading systems were used in 47.8% TCM guidelines, while the rest 52.1% did not sue any evidence grading system. For the eight evidence grading systems, the different codes, letters, and level of evidence are likely to lead to confuses and even misunderstanding for guideline users when using TCM guidelines.
The definitions to the eight grading systems also varied a lot, for example, the GRADE system graded the quality of evidence according to bias which might decrease the confidence of included studies, that is to say, even RCTs might be at low quality of evidence if they were at high risk of bias [12]. On the other hand, the other systems such as NEEBGDP and Oxford system, they classified the level of evidence according to study design, which led to that RCTs would always be graded  as high quality of evidence whether they were at high or low risk of bias. Similar situations happened in the strength of recommendations, it did not classify expert opinions as a level of evidence in GRADE system, but it was a level in the TCM grading system and NEEBGDP system. Besides, the codes from the eight grading systems were very different (Table 1), for example, the "A, B, C, D" codes were used to reflect the quality of evidence in the GRADE system, and the adapted David Sackett system, but they were also used to indicate the strength of recommendation in the TCM grading system, SIGN system, adapted NEEBGDP system, and AHRQ system. We could imply that fresh health care professionals, especially medical students, perhaps, are easily puzzled by the information of different grading systems conveyed. Thus, various grading systems may not fulfill their original intention. Indeed, if the same code, used by different systems, represents different meanings, may result in bewilderment and incomprehension.
Considering different grading systems would generate different results of level of evidence and strength of recommendation, which could contribute to challenges of implementing of TCM guidelines, therefore a standardized and well-recognized grading system is needed for TCM guidelines. A study evaluated the effect of presenting a recommendation in a clinical practice guideline using different grading systems to determine to what extent the system used changes the clinician's eventual response to a particular clinical question, they found the clinician's decision to use a therapy was influenced most by the GRADE system [24]. It would be valuable to conduct a similar study in TCM guidelines to determine which grading system influence most to TCM clinicians.
Some international guideline organizations such as SIGN and AHRQ had realized the importance of reaching consensus to standardized grading system and began to adopt the GRADE system in their new guidelines. The GRADE system was developed by GRADE working group in 2000, which has been adopted by more than 100 national and international organizations [25]. Similarly, the Cochrane Collaboration now requires authors to use GRADE for all important outcomes in their systematic reviews. Hence, that might be an option for TCM guidelines to use or adapt GARDE system in the future. In addition, some Chinese researchers argued external grading systems such as GARDE and oxford might not be suitable for TCM guidelines when considering the big gaps between TCM and western medicine, for example, the construction of TCM guidelines follow some principles from ancient TCM books, some TCM treatments are only used in Asian counties, hence, they generated a new grading system for TCM guideline, i.e., the TCM grading system reported in this study. Now the TCM grading system was updated in 2019 to make it more approachable and understandable for users [26].
Another important issue is that we found more than 50% of TCM guidelines did not adopt any grading system. This could limit the application of TCM guidelines in practice. It is almost a consensus among international guideline organizations that a unique systematic approach to grade evidence and strength of recommendations can minimize bias, aid interpretation and promote the implementation of guidelines [4], as grading systems could promote guidelines delivering a clear message so as to help guideline readers understand recommendations quickly and concisely. According to a survey, there are more than 60 grading systems with extremely wide variations to quality of evidence and strength of recommendations [27], so it still has a long way to reach a consensus of using a standard system in different guideline organizations. Most TCM guideline developers didn't seem to realize the advantages of using grading systems. Good news is that one guideline handbook for integrative medicine was published in 2016 [28] and using the GRADE system to rate the quality of evidence and strength of recommendations was encouraged, which will promote the use of grading systems in the coming TCM guidelines.
Normally, the higher the quality of evidence, the more likely a strong recommendation can be made [4]. However, in this study, only 7.8% recommendations, based on a high level of evidence (grade I, meta-analysis and RCT), which would downgrade our confidence to the TCM interventions in some degree. According to Deng et al's study [29], the low proportion of meta-analysis and RCT referred by TCM recommendations was associated with insufficient search of electronic databases, and poor rigor of development process. We should also notice with the quality of meta-analysis and RCTs referred by TCM guidelines, Yao et al. [30] study showed that the quality of TCM meta-analyses published in 2010-2012 was significantly lower than non-TCM metaanalyses. And Wu et al. [31] found that more than 90% of RCTs published in core Chinese journals lacked an adequate description of randomization. They found most trials, despite being claimed to be RCTs, did not fulfill the criteria of a real RCT, which suggested a lack of adequate understanding of rigorous clinical trial design and conducting among the investigators.
The conflicts of interest in guidelines, which defines as any interest held by an expert that may affect or reasonably be perceived to affect the expert's objectivity and independence in providing advice, can influence the level of evidence and strength of recommendations [32]. WHO divides the conflicts of interest into three categories: financial, academic and public positions [33]. A financial relationship could impact an individual's ability to approach a scientific question with an open mind, and academic activities could create the potential for an attachment to a specific point of view that could unduly affect an individual's judgment about a specific recommendation [33]. Although 73.2% of TCM guidelines reported the funding source, none of them reported the conflicts of interest. We might exclude the possibility of financial conflicts of interest in TCM guidelines as none of them reported funding from pharmaceutical companies, however academic and public conflicts of interest remain unclear. More importantly, there are still 26.8% TCM guidelines did not report any information about funding or conflicts of interest, the insufficient transparency to funding and conflicts of interest statement would downgrade the confidence to TCM interventions.
Subgroup analyses indicated that the proportion of the high level of evidence from TCM guidelines published in CSCD journals was not better than in non-CSCD journals. This might indicate that the peer review process in CSCD journals is probably not more rigorous than in non-CSCD journals. This is surprising because CSCD journals are required to meet more criteria, compared to those in non-CSCD journals.
Addressing the limitations of existing TCM guidelines, a standardized grading system developed by multidisciplinary experts is needed. Using the same grading system and codes will make TCM guidelines more understandable. Moreover, informing TCM recommendations underlying the evidence-based approach, as well as using systematic reviews to support recommendations will make TCM guidelines more trustworthy.

Strengths and limitations
Our overall findings have some strengths. Firstly, this study is the first to examine the evidence type and grading systems of recommendations in TCM guidelines. Secondary, each grading system has been appropriately compared according to their codes and definitions, which provided clear information about the differences among them. Thirdly, and subgroups analyses were conducted to detect the variation of different groups among level of evidence and strength of recommendation. Nonetheless, our study has several limitations. We included TCM guidelines in journals and books, but we may have missed guidelines published in other forms such as web page, which might understate the performance of TCM guidelines. Second, our study could not establish the causality between the poor performance and the characteristics of TCM guidelines. Besides, the evidence level of each recommendation was appropriately re-classified according to evidence type by evaluators, which might lead to some bias to interpret the quality of TCM guidelines.

Conclusions
Various grading systems were used in TCM guidelines, which might confuse guideline users. The low proportion of high level of evidence in recommendations could downgrade the confidence to TCM interventions. A standardized grading system should be established in TCM guidelines. And more high level of evidence should be used to enhance the confidence of recommendations and promote the dissemination and implementation of TCM guidelines.
Additional file 1. Search strategies for TCM guidelines.