Discussion
This study aimed to understand how WHO has responded to the reforms made to its guideline development process over 10 years ago, including the major changes it made in 2007, the progress and impact of these changes to date, and key challenges that need to be addressed. Informed by the semistructured interviews with WHO staff and previous studies on WHO’s guideline development process, we propose and discuss two recommendations that can inform efforts to improve the guideline development efforts of technical health agencies like WHO and others at the local, national and international levels.
Recommendation 1: Guideline development processes in technical health agencies and institutions should learn from WHO’s vast experience with implementing independent evaluation, monitoring and feedback for process and quality to ensure the legitimacy of recommendations
One major factor contributing to the legitimacy of a health recommendation is the underlying evidence base, or in other words, the extent to which recommendations are consistent with the quality of available research evidence. Our study did not quantitatively address this question, but other studies suggest that there is still room for WHO to improve. For example, one study found that over 50% of strong WHO recommendations are based on assessments of evidence that place low or very low confidence in effect estimates (known as ‘discordant recommendations’), with the majority of these being inconsistent with the GRADE approach.6 12
However, the strength of the evidence alone is insufficient to ensure the legitimacy of a recommendation. A second major contributing factor to legitimacy is the decision-making process through which the recommendations are developed.30 Good process is particularly important in cases where recommendations might implicate commercial or ideological interests. A strategy that interested actors often use to challenge the legitimacy of recommendations, even in cases where the underlying evidence base is strong, is to question the process by which guidelines were generated, including the selection of experts or the participation of relevant stakeholders.
One way of strengthening the legitimacy of decision-making processes and protecting recommendations from undue criticism is to ensure that the process followed was subject to independent evaluation, monitoring and feedback for quality. WHO has over 10 years of accumulated experience with implementing its GRC mechanism, which can be seen as an internal quality assurance mechanism augmented by external expertise. Its assessments are made independently of WHO’s senior management and its Member States. Accordingly, the GRC can be seen to serve two roles: (1) represent an institutional mechanism for independent evaluation that can strengthen the legitimacy of the decision-making process underlying the recommendations and (2) by involving staff internal to the agency support gradual institutionalisation of evidence-informed principles and processes. Learning from WHO’s experience with implementing independent evaluation, monitoring and feedback for process and quality to ensure the legitimacy of recommendations can be relevant for other technical agencies and institutions responsible for guideline development on health issues.
Recommendation 2: Guideline development processes at WHO should be designed to better acquire, assess, adapt and apply the full range of research evidence that can inform recommendations about health systems and public health
The second recommendation calls for adapting WHO’s guideline development process to better enable assessment of the evidence base needed to inform health systems and public health interventions, many of which, if not all, are ‘complex interventions’.31 Unlike individual-level interventions which can more easily be evaluated by randomising study participants to receive either treatment or control, many health systems and public health interventions cannot be randomised for ethical, legal or logistical reasons, even if governing decision-makers are supportive of doing so. Accordingly, evidence from non-randomised study designs—for example a well-designed observational study, quasi-experimental impact evaluation or systematically documented evidence from programme experience—may represent the highest quality of evidence one can expect for a public health or health systems intervention.9 32 This challenge is not unique to guideline development processes at WHO but has been debated at various points over the past years.14 33–40 It represents an important factor explaining why WHO interviewees raised concerns over the guideline development process not being flexible enough to incorporate and appropriately evaluate evidence from non-randomised study designs and qualitative studies. Similar views have previously been reported to be held by WHO staff9 and guideline panel members,11 41 and confirmed by methodologists with experience serving on guideline panels.15
Two design features of WHO’s guideline development process seem to particularly need adaptation to fully incorporate the evidence base needed to inform recommendations. The first design feature is the use of systematic reviews to critically appraise relevant research underpinning WHO recommendations. Systematic reviews are among the cornerstones of guideline development processes and are critical for reducing the risk of bias and reaching reliable evidence-informed conclusions. By way of background, systematic reviews are commonly conducted by formulating a clear and specific question most commonly using the PICO format (population, intervention, comparison, outcome) and by having explicit inclusion/exclusion criteria. This rigorous approach helps to identify and include studies that are comparable and able to respond to questions about effectiveness of interventions (what works).36 This approach works sufficiently well for evaluating the effectiveness of a medical treatment. However, many health systems and public health interventions are ‘complex’ interventions, characterised by multiple interacting components, requiring the involvement of different organisational levels, having a number of different points of interactions between interventions and the settings in which the interventions are implemented and affecting different outcomes.31 37 38 For complex interventions, framing systematic review questions too narrowly and relying solely on assessing evidence of effectiveness risks excluding the broader range of relevant evidence needed to inform recommendations. This includes evidence for factors important for implementing an intervention, for bringing an intervention to scale, for assessing the resources needed to implement interventions across different settings, for understanding the feasibility and acceptability of an intervention, for identifying the interactions among various components of complex interventions and for probing the systems in which the interventions were implemented.42 These represent factors that are not easily identifiable if systematic review questions are narrow and solely include experimental intervention studies focused on safety and effectiveness. Currently, the WHO Handbook for Guideline Development do not offer comprehensive guidance for adapting the PICO format or considering alternative frameworks when dealing with systematic reviews for health systems and public health interventions.14 In light of new tools and frameworks for conducting systematic reviews for complex interventions,43 WHO might consider adapting its guidance.
The second design feature of WHO’s guideline development process that has led to it being viewed as inflexible involves the approaches and tools used to evaluate the quality of the evidence base. At WHO, the GRADE approach has been the main tool for assessing the quality of the evidence underlying recommendations. This study’s findings highlight three aspects associated with GRADE that seem to have reinforced the view among many WHO staff that guideline development processes are not designed to incorporate a broader evidence base. The first is an insufficient understanding among many people involved in guideline development processes about the purpose, utility and implementation of the GRADE approach. This was also highlighted by Sinclair et al in their evaluation of WHO’s guideline development process,9 suggesting that better understanding of GRADE among WHO staff involved with guideline development needs continued attention. To this end, guidance to promote a more sophisticated understanding of GRADE has been issued (including by the GRADE Working Group),40 but greater awareness and wider implementation is needed at WHO. The second aspect is GRADE’s initial rating of certainty in the evidence from non-randomised study designs as ‘low’. This is a feature of GRADE that guideline developers beyond WHO have raised concerns over, since in fields where RCTs are sparse or not feasible, the quality of the evidence rarely will be rated as ‘high’ or ‘moderate’13; fortunately, this is a criticism that the GRADE Working Group have noted and are seeking to address.44 The third aspect is that GRADE was not designed to evaluate the quality of evidence from qualitative studies, which is increasingly recognised as crucial for informing decision-makers about the needs, values, perceptions and experiences of stakeholders important for an intervention, and the system-level factors affecting implementation.41 45 The reliance, at least until very recently, on GRADE as the sole tool for assessing the body of evidence might have created a perception that guideline development processes are not intended to incorporate a broader evidence base, including qualitative evidence. These concerns over GRADE raised by WHO staff in our study align with challenges highlighted by others.14 32 46 47
On all of these fronts, there have been recent developments worth noting. For evaluating non-randomised study designs, the ROBINS-I tool (Risk Of Bias In Non-randomized Studies-of Interventions) has been developed, with further extensions planned.48 Using ROBINS-I as part of GRADE assessments can enable better comparisons of evidence from non-randomised study designs and RCTs, as well as more detailed assessments of different types of non-randomised study designs (such as rating evidence from a well-designed interrupted time series studies higher than conventional non-randomised study designs), thereby addressing one of the major concerns raised by WHO interviewees.
For evaluating the quality of qualitative evidence, WHO has together with collaborators taken a leadership role. Its own guideline development process for recommendations about optimising health worker roles for maternal and newborn health expanded the evidence base beyond safety and effectiveness,41 leading to the development of a new approach for assessing the confidence that can be placed in qualitative evidence—the Grading of Recommendations, Assessment, Development and Evaluation—Confidence in Evidence from Reviews of Qualitative Research (GRADE-CERQual) tool.49 GRADE-CERQual has later been further developed and implemented as part of WHO guideline development processes,45 addressing yet another concern raised by WHO interviewees. Moreover, the Evidence-to-Decision Framework developed by the DECIDE project creates space for the assessment of an intervention’s acceptability and feasibility and is increasingly being used in WHO’s guideline development processes.50–52 Finally, the challenges with synthesising and assessing the quality of evidence for complex interventions, and the need for guidance and tools are increasingly recognised, both within and beyond WHO.43 53
Specific actions for WHO
We identify at least three specific areas where action could be taken by WHO (box 2). First, there should be more frequent and systematic sharing of experiences among WHO departments and between the GRC and the various departments that develop guidelines. Such sharing and continuous professional development for WHO staff would help address the many issues raised by this study and others.9 11 15 Second, the guideline development process should be further enhanced to meet the needs of health systems and public health interventions, which is consistent with recent calls in peer-reviewed journals from senior WHO staff.54 It is therefore timely that efforts are underway to examine extensions to the GRADE approach,55 as well as efforts led by the GRADE Working Group to integrate GRADE assessment with the use of tools such as ROBINS-1.44 Moreover, WHO has internally recognised this and other challenges with its guideline development process56 and initiated its own process for improving retrieval, synthesis and assessment of evidence on complex health interventions, which might inform future changes to the design of WHO’s guideline development process.53
Box 2Two general recommendations and three specific actions for WHO
Two general recommendations
Guideline development processes in technical health agencies and institutions should learn from WHO’s vast experience with implementing independent evaluation, monitoring and feedback for process and quality to ensure the legitimacy of recommendations.
Guideline development processes at WHO should be designed to better acquire, assess, adapt and apply the full range of research evidence that can inform recommendations about health systems and public health.
Three specific actions for WHO
WHO should foster the systematic sharing of experiences and learning among its departments that are or are planning to engage with guideline development processes so as to promote continuous professional development of its staff.
WHO should share its experience externally (such as with the GRADE (Grading of Recommendations Assessment, Development and Evaluation) Working Group and its subgroups) as part of an effort to further optimise the guideline development processes to meet the needs of health systems and public health interventions (eg, complex interventions).
WHO should consider whether outputs from scientific advisory committees that currently operate outside of the formal guideline development rules should be subject to a centralised quality assurance process.
Finally, WHO should consider whether all products containing advice and guidance that emerge from the plethora of WHO’s technical departments and scientific advisory committees—many of which currently operate outside of the GRC’s mandate—could benefit from a centralised quality assurance process independent of WHO Member States, similar to what is currently performed by the GRC for WHO’s formal guidelines. This may improve quality and legitimacy, but it will also require resources, time and planning. On this front, a recent development is that WHO has proposed in its draft 13th General Programme of Work to ‘establish guiding principles and quality assurance procedures for the design, formulation and dissemination/follow-up of all normative products (all normative products, including strategies, road maps and global action plans will be based on agreed standards and reviewed independently, as is the case for technical guidelines), including maximizing the use and engagement of top international experts’ 57—a proposal informed by a 2017 review of WHO’s normative functions.58
Strengths, limitations and reflections on study design and data analysis
We identify three main strengths and two main limitations of this study. The first strength is the large number of interviewees (n=16) who had experience with WHO’s guideline development process, as well as additional interviewees (n=19) working with other structures that produce scientific advice (eg, expert committees, scientific and technical advisory groups), which enabled us to consider WHO’s broader context when interpreting our findings. A second strength is that the majority of the interviewees were senior WHO staff who have been working for the agency since before the guideline development reforms were initiated and therefore were able to inform our study with their experiences before and after the reforms. The third strength is the diversity in technical areas represented by the interviewees, which enabled us to identify themes that were relevant to guideline development processes across WHO’s technical areas. The invited WHO staff who for various reasons decided to not participate in this study did not, with respect to roles and technical areas, differ from those who were interviewed since we managed to recruit interviewees from various levels and across the many technical areas of the agency. Overall, our analysis was informed by a large amount of qualitative data consisting of diverse sets of relevant experiences accumulated over a long period of time.
The first main limitation is that the study was initially conceived to examine the design features of WHO’s scientific advisory committees in general, and not specifically to evaluate WHO’s guideline development reforms. We may therefore have overlooked asking important questions that could have deepened insights about WHO’s experience with its reformed guideline development process. For example, while all interviewees emphasised the importance of diverse representation in guideline development groups, we did not probe in detail their experience with involving populations affected by the recommendations during guideline development—which is another important factor affecting the legitimacy of recommendations. The second main limitation is social desirability bias—that WHO interviewees may have responded in a way that casts the agency in a favourable light, downplaying internal weaknesses and challenges. However, the majority of interviewees provided candid assessments of the agency’s progress and challenges with guideline development and producing scientific advice. Moreover, some also spoke rather critically of the agency, such that we do not believe the results are unduly biased in this way.
Finally, choices made during data analysis are worth discussing in light of different views and traditions with respect to improving reliability in qualitative research. In our study, both investigators discussed and reached agreement about the identified themes and the fit of the coding with these themes. However, we did not implement independent coding by two investigators; rather, one investigator undertook the initial coding and identification of preliminary themes, which were subsequently discussed and refined in dialogue with the second investigator. This strategy may be seen as a weakness by researchers who argue that multiple, independent coding and calculation of inter-rater reliability is a prerequisite for rigour and trustworthiness in qualitative research.59–61 However, our approach is in line with strategies undertaken and advocated by many qualitative researchers, including Braun & Clarke who have developed an approach to thematic analysis which closely resemble the analytical strategy undertaken in this study.18 62 They and others argue that there is no ‘one’ accurate way of coding and interpreting qualitative data and that it is unrealistic to expect different researchers to reach exactly the same insights from qualitative data, since they may differ in disciplinary backgrounds and theoretical starting points. What remains important is full transparency about choices made during data analysis so that others can evaluate how these choices may have affected analysis and interpretation. Moreover, it remains important to otherwise minimise the risk of misrepresenting the qualitative data. To address the former, we have described our approach to data collection and analysis in great detail in the Methods section. To address the latter, we implemented participant checking. While less than half of the interviewees responded to our queries, only one interviewee raised objections with the way the findings were presented. We assume, but cannot be completely certain, that if other interviewees had similar objections, they would have expressed these after receiving the interview summaries or the manuscript. Moreover, two WHO officials who were not interviewed for this study, but with extensive in-depth experience with WHO’s guideline development process, reviewed the manuscript and reported to recognise the experiences and key challenges identified by our study. Overall, we believe that the reported findings and interpretation do not misrepresent the interview data, but accept that these findings could be interpreted differently by other researchers. We therefore invite continued debate on issues raised by this study.