Article Text

Download PDFPDF

Making sense of the evidence in population health intervention research: building a dry stone wall
  1. David Ogilvie1,
  2. Adrian Bauman2,
  3. Louise Foley1,
  4. Cornelia Guell3,
  5. David Humphreys4,
  6. Jenna Panter1
  1. 1MRC Epidemiology Unit, University of Cambridge, Cambridge, UK
  2. 2School of Public Health, The University of Sydney, Sydney, New South Wales, Australia
  3. 3European Centre for Environment and Human Health, University of Exeter, Truro, UK
  4. 4Department of Social Policy and Innovation, University of Oxford, Oxford, UK
  1. Correspondence to Dr David Ogilvie; david.ogilvie{at}


To effectively tackle population health challenges, we must address the fundamental determinants of behaviour and health. Among other things, this will entail devoting more attention to the evaluation of upstream intervention strategies. However, merely increasing the supply of such studies is not enough. The pivotal link between research and policy or practice should be the cumulation of insight from multiple studies. If conventional evidence synthesis can be thought of as analogous to building a wall, then we can increase the supply of bricks (the number of studies), their similarity (statistical commensurability) or the strength of the mortar (the statistical methods for holding them together). However, many contemporary public health challenges seem akin to herding sheep in mountainous terrain, where ordinary walls are of limited use and a more flexible way of combining dissimilar stones (pieces of evidence) may be required. This would entail shifting towards generalising the functions of interventions, rather than their effects; towards inference to the best explanation, rather than relying on binary hypothesis-testing; and towards embracing divergent findings, to be resolved by testing theories across a cumulated body of work. In this way we might channel a spirit of pragmatic pluralism into making sense of complex sets of evidence, robust enough to support more plausible causal inference to guide action, while accepting and adapting to the reality of the public health landscape rather than wishing it were otherwise. The traditional art of dry stone walling can serve as a metaphor for the more ‘holistic sense-making’ we propose.

  • prevention strategies
  • public health
  • intervention study
  • systematic review

This is an open access article distributed in accordance with the Creative Commons Attribution 4.0 Unported (CC BY 4.0) license, which permits others to copy, redistribute, remix, transform and build upon this work for any purpose, provided the original work is properly cited, a link to the licence is given, and indication of whether changes were made. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Summary box

  • Systematic reviews and guidance development groups frequently conclude that the available evidence about the effects of population health interventions is too diverse, flawed or inconclusive to support a more general conclusion about what should be done.

  • In spite of all the developments in quantitative methods for primary research and evidence synthesis, we struggle to derive meaningful generalisable inferences from the evaluation of interventions in arenas such as the food, transport or welfare systems to guide and support public health action.

  • We respond to a long-standing call for more ‘holistic sense-making’ in this arena by proposing a more eclectic, flexible and reflexive approach to building and interpreting the evidence.

  • We show how a spirit of pragmatic pluralism might be channelled into constructing ‘dry stone walls’ of evidence, robust enough to support more plausible causal inference to guide action, while accepting and adapting to the reality of the public health landscape rather than wishing it were otherwise.

  • We should look beyond simple notions of ‘interventions’, search for patterns and embrace the mess in evidence synthesis in order to better understand what makes for an effective public health strategy.


Effectively tackling population and planetary health challenges such as climate change or diabetes requires us to address the fundamental, upstream determinants of behaviour and health in populations.1 This may sometimes entail contentious policies such as diverting funds from other priorities or constraining people’s freedoms, which ought to be guided by the best available scientific evidence. To this end, it is increasingly accepted that we should advocate, fund and strengthen the evaluation of interventions in arenas such as the food, transport or welfare systems, often in the form of natural experiments.2

However, as the ongoing drip-feed of contested and contradictory research findings in respect of coronavirus pandemic control measures has illustrated, merely increasing the supply (and rigour) of primary studies is not enough.3 Governments have to make decisions all the time. The pivotal link between research and policy or practice should be the cumulation of insight from multiple studies in some form of evidence synthesis,4 but systematic reviews and guidance development groups frequently conclude that the available evidence about the effects of population health interventions is too diverse, flawed or inconclusive to support a more general conclusion about what should be done.2

One reason for this is that studies conducted in ‘real-world’ settings are often critiqued for a lack of internal validity in comparison with randomised trials in more controlled settings. This may be compensated for by greater external validity—the likelihood of producing practice-based evidence that might be successfully translated to the systems in which others work.5 However, the fact that these studies are produced in particular settings is also the main apparent impediment to their generalisability. Interventions to change such things as how products are taxed, how cities are laid out or how society supports people in old age inevitably take place in particular places with particular characteristics, which vary widely across the globe and even within countries. How might we do a better job of deriving meaningful generalisable inferences from studies like this to guide and support public health action in other places?

Promising solutions or false refuges

Feeding the meta-analytical machine: piling the stack and singing in harmony

Thrombolysis was not routinely used to treat heart attack until the late 1980s. If the available trials had been combined in a meta-analysis, however, its effectiveness would have been established beyond reasonable doubt by 1973.6 Precedents like this suggest that the solution to a lack of evidence is simply to conduct more intervention studies in more places, on the basis that once we have a tall-enough stack of good-enough papers to populate a meta-analysis, we will know.

In practice, however, many systematic reviews of population health improvement strategies have been more successful in delineating what we do not know than in identifying unequivocally effective interventions.2 Often, what has prevented the formulation of clear answers is not so much a lack of studies as the lack of a way of reconciling the diversity of their study designs, limitations, interventions and contexts.7 One apparent solution is to limit meta-analysis to a set of more statistically comparable studies, but this risks perpetuating an evaluative bias in favour of an intervention ‘monoculture’ that may or may not include the most promising strategies.8 Another way of dodging the challenge is to split the problem into more and more discrete and evaluable chunks. These may eventually tell us the effect of doing X, but however refined the answers to this kind of ‘splitting’ question turn out to be, they are not sufficient to address the more pressing ‘lumping’ question for public health: how can we best achieve Y?9 10

If population-level intervention studies were to use a common set of exposure and outcome measures, this would make meta-analysis more feasible. Important progress has been made in this respect, for example in physical activity epidemiology.11 One might envisage some form of multicentre study in which more-or-less comparable interventions were introduced (or not) in different places and evaluated along the lines of a cluster randomised controlled trial. However, the Achilles heel of this vision is the qualifier ‘more-or-less comparable’. Some interventions, such as screening programmes, might be designed and implemented in a sufficiently similar way for this kind of multicentre evaluation.12 For upstream interventions in complex systems, however, the harmonised measurement of exposures (to interventions) and intermediate and final outcomes involving multiple causal pathways is challenging. Consider, for example, the array of measures of pricing, product formulation, purchasing, consumption, diet, health, and potentially confounding background trends that are needed to properly quantify the intended and unintended impacts of introducing a national levy on sugar-sweetened drinks.13 Negotiating the harmonised implementation of truly comparable interventions in multiple jurisdictions and beyond the control of researchers may be even less feasible.

Broadening the scope: building the panopticon and modelling the solutions

If empirical intervention studies are so difficult to design, implement or combine in meta-analysis, why not make more use of observational and simulation methods? The growth of ‘big data’ and interest in the ‘quantified self’ now offer unprecedented possibilities to gather enormous quantities of information, whether from surveillance systems such as traffic cameras or portable devices unobtrusively capturing continuous geographical, physiological and other signals from individuals. This torrent of data has led us towards a contemporary version of Bentham’s panopticon, telling observers exactly what people are or have been doing, where, when and even with whom. Datasets of this breadth, depth and precision make it possible to investigate associations with an unprecedented degree of statistical power and analytical complexity, and one might assume that so long as sufficiently rich data are available to populate such analyses, more and more secure causal inference will follow. The results can also be used as inputs to tools such as systems dynamic modelling—to simulate the consequences of altering upstream determinants of health, identify new intervention points and explore to what extent the outcomes observed in one system may be generalisable to others.14

However, enthusiasm for increasing computational complexity in the search for causal inference from observational or simulation data should be tempered with the recognition that design-based inference is generally considered stronger than model-based inference.15 In other words, we should attend at least as much to investigating situations in which different groups are exposed to different exogenous factors (interventions, or at least determinants of change) as we do to refining ways of eliciting ‘causal’ evidence from other datasets. Even if well-founded concerns about the representativeness and privacy implications of relying on ‘big data’ can be addressed, the resulting associational cornucopia is unlikely to help much if it contributes merely to producing ‘ever more sophisticated answers to the wrong questions’.16

For example, hundreds of studies now tell us that more walking is reported in areas where it is easier and safer for people to walk and there are places for them to walk to.17 However, precisely quantifying dose–response relationships of this kind does not necessarily explain how to address the problem of comparative inactivity, just as proving the aetiological case against tobacco did not explain how to reduce the prevalence of smoking.18 We should therefore not assume that the answers to the question of what we should do will be found by searching for statistical associations that only become noticeable in extremely large samples. Cohort and surveillance data collected for other purposes can certainly be used to investigate the effects of interventions,19 but no matter how intensively people’s health, behaviour and environments are quantified in observational studies, it may be a category error to assume that this will necessarily explain whether or how public health strategies actually work (or not). Epidemiology is only one of the tools in the box.20

Is the craft of evidence synthesis fit for purpose?

The point of evidence synthesis is surely to derive more generalisable causal inference. In spite of the academic language, this is as much an applied problem as an abstract, theoretical problem; to put it another way, what can transport planners in Birmingham learn from what their counterparts did in Bogotá?5

If cumulating evidence from multiple studies can be thought of as analogous to building a wall, then the ‘solutions’ outlined above can be regarded as ways of increasing the supply of bricks (the number of pieces of information), the similarity (statistical commensurability) of the bricks or the strength of the mortar (meta-analytical or other statistical methods for holding them together). These are helpful if the aim is to build a larger and stronger conventional wall, formed of neat rows of bricks of roughly the same shape and size.

Important and useful as all these approaches are, they have the potential to distract us from the real problem. Conventional brick walls work best on flat, smoothly prepared ground. Many contemporary public health challenges seem more akin to herding sheep in a mountainous landscape characterised by steep slopes, rocky outcrops and boggy ground. In this terrain, the more artisanal, bespoke and traditional solution of the dry stone wall may be more useful (figure 1). Dry stone walling is a way of transforming a pile of stones, which at first glance do not fit together, into something new and useful. Each stone is considered in its own right and assigned a unique place in the wall. No mortar is required, because careful thought is given to how all the pieces can be related to form a robust structure that is more than the sum of its parts. The art can be learnt, but it requires a level of flexibility and ingenuity that cannot readily be codified. It can therefore stand as a metaphor for the 'holistic sense-making' required of the evidence in population health intervention research.21 How might we better harness our research skills and technologies—ancient and modern—to build evidential structures more suited to the terrain we inhabit?

Building a dry stone wall of evidence

Looking beyond ‘interventions’

For most public health interventions—even well-established population screening programmes—the only honest answer to the question ‘Does it work?’ is ‘It depends’.14 22 In most cases, the questioner will need to clarify what they mean by work (in what terms?), what they mean by it (what, exactly, is the intervention?) and indeed what they mean by does (which implies a generalisable inference). Sometimes, what people really mean is ‘Will it work?’—a predictive question—or ‘Did it work?’—an empirical but also a particular question, perhaps better formulated as ‘What happened?’10

Why is this so difficult? Most public health interventions are at least somewhat unique to their context, which ought to be taken account of in their evaluation, and many can also be seen as interventions in complex systems.10 12 22 However, they do not necessarily take place in a context, at least not in the sense that a new clinical procedure might be introduced in a certain hospital or healthcare system. Rather than exerting an effect within a ‘moderating’ context, it may be more helpful to see these interventions as targeting and altering the context in which people live and make choices.10 23 24

But if everything is complex and context ‘is’ the intervention, what exactly might we seek to generalise from one instance to another? We tend to assume that interventions are things that work in and of themselves and might be universally generalisable, like Newton’s laws of motion.25 26 In practice, however, we find ourselves struggling to make sense of an apparently incommensurable body of evidence. By adopting a ‘naive and misplaced commitment to the reproducibility of the complex’, we may have unwittingly set ourselves an unrealistic challenge of identifying generalisable interventions as such.22 27 28 What if we were to release ‘our search for universal generalisability in favour of more modest, more contingent, claims’?22 Among other things, this would entail relaxing our grip on the notion of generalising the effects of interventions based on their forms or their ‘active components’, and turning our attention instead to their functions—the processes and changes they evoke—or indeed their ‘spirit’.12 26

To take a well-known clinical example, some reviews have found that the type of psychotherapy offered to a patient with depression makes little difference to the outcome.29 What to do, then? Others, taking a different approach to analysis, have found that the key lies in the quality of the therapeutic relationship established rather than the particular techniques used.30 Reasoning by analogy, we theorised that in another arena—promoting active travel—the myriad forms of intervention might be underlain by a more limited number of critical functions, such as increasing accessibility or safety, that are generalisable in principle but might be achieved in different ways in different situations.31 Again, this is as much to do with practical strategies as it is to do with theories, and some public health guidance and government policy already implicitly reflects this way of thinking with references to high-level principles, variable interpretations and the like (examples: table 1).

Table 1

Examples of generalisable principles reflected in policy and practice guidance

Searching for patterns

If this idea has traction, we will need to expand the scope and flexibility of our repertoire in evidence synthesis in order to derive more concrete and defensible inferences about what to do for public health. Rather than assessing whether interventions of a particular type ‘work’ in an overall sense, this will entail aggregating evidence for and against theories about intervention functions by combining information from studies conducted in different situations, including studies that were not explicitly designed with this in mind.7 14 27 32 33

How might we do this? We could accept the limitations of relying so heavily on testing binary statistical hypotheses about singular study outcomes,2 14 and turn our attention to seeking ‘inference to the best explanation’—that which provides the greatest understanding.18 34 35 We could use intervention theory to predict patterns that might be observed in a variety of data, and then assess the concordance between the observed patterns and the theoretical expectation patterns—testing theories rather than interventions.33 36 We could go further by systematically considering alternative potential explanations for the patterns we observe, doing our best to confirm or disconfirm these, and reaching a conclusion as to the most plausible causal inference from the overall pattern of findings. The approach is most easily illustrated within a single study, an evaluation of new transport infrastructure that was not designed to test any singular overarching hypothesis (case study: table 2).

Table 2

Case study of the dry stone wall principle applied to an intervention study

One might counter that the principle of comparing observed and expected data applies equally to the paradigm of the randomised controlled trial. While this is true, testing theories in the way we describe entails a more radical challenge to established notions of a hierarchy of study design. For example, it is often understood that quantitative methods are for testing hypotheses, whereas qualitative (and some quantitative) methods play more subservient roles such as generating hypotheses, developing interventions or assessing their acceptability.32 One might also counter that approaches such as process evaluation, or realist evaluation and synthesis, already offer ways of investigating causal mechanisms.7 24 27 37 While this is true, the higher-order intervention functions and data patterns we are talking about are likely to reflect multiple underlying mechanisms25 and others have argued that more diverse lines of evidence should be converged and brought to bear on the challenge of overall causal inference. These might combine a variety of quantitative sources of causal estimation with a variety of quantitative and qualitative sources of causal explanation such as causal process observations.15 22 34 Simple examples from historical and contemporary communicable disease control illustrate this principle (table 3).

Table 3

Examples of arguments for convergent lines of evidence in communicable disease control

Embracing the mess

Some public health strategies will inevitably be more successful than others, and every ‘solution’ has the potential to generate more problems. Such uncertainty is—or at least should be—what drives scientific enquiry in the first place.38 Rather than hoping for 'a neat, coherent story’ of clear-cut outcomes from evaluation, therefore, in most cases we should expect confusing, divergent, mixed or unexpected patterns of results.21 Far from denoting that an intervention or evaluation has failed, these shed light on what really happened, whether we like it or not.10

While the metaphor of the dry stone wall can be applied at the level of the individual study, as shown in table 2, the mess of this apparent dissonance may be better resolved not at the level of the individual study, but by cumulating evidence from multiple studies over time within an intervention research programme or a systematic review. Our final case study illustrates how we applied the principles of linking diverse sources of evidence on causal estimation and causal explanation to identifying patterns and testing theories about intervention functions across a cumulated body of work on infrastructure to support active travel (table 4).31 It draws on a variety of research methods, resting on different philosophical assumptions, in pursuit of ‘clarification and insight, for which a more interpretive and discursive synthesis is needed’.3 9

Table 4

Case study of the dry stone wall principle applied to a systematic review


Many readers engaged with conducting, synthesising or applying the findings of population health intervention research are likely to agree with the editors of the Cochrane Handbook, who recently wrote that in spite of all the developments in quantitative methods for evidence synthesis, it is frequently still not possible for these ‘to provide insight beyond a commentary on what evidence has been identified’.7

We need to find a better way, otherwise merely piling up more studies may leave us confronting another kind of stack—endlessly circling the runway of a conclusion on which we never seem to have clearance to land. In this paper we have responded to a long-standing call for more ‘holistic sense-making’ in this arena. We have outlined a strategy of constructing ‘dry stone walls’ of evidence: pluralist mosaics whose strength derives from the complementarity of their components rather than being found in spite of it.39 This approach has the potential to be robust enough to support more plausible inference to guide action, while accepting and adapting to the reality of the public health landscape rather than wishing it were otherwise.

This will entail facing up to the challenge of working as ‘scholars, rather than just researchers’21—that is, artisanal dry stone wallers rather than bricklayers. We are advocating not a new method of evidence synthesis as such, but a more eclectic, flexible and reflexive approach; ‘not the abandonment of more reductive lines of research but the enlargement of these’40 with the more thoughtful and practical application of theory to generating practice-based evidence in public health. Ironically, it may only be by combining growing quantitative sophistication with the least technologically dependent research method of all—the anthropological tradition of the ethnographic observation of people and societies—that we will really understand what makes for an effective public health strategy.



  • Handling editor Seye Abimbola

  • Twitter @dbogilvie, @adrianbauman, @loudoestweet, @connyguell, @dkhumphreys, @jennapanter

  • Contributors DO conceived the original idea and drafted the manuscript. AB provided critical feedback during the initial drafting. LF, CG, DH and JP undertook the studies used as worked examples in tables 2 and 4 in collaboration with DO; and together with AB provided critical feedback during later drafting and contributed to the final version of the manuscript. DO is the guarantor.

  • Funding DO and JP are supported by the Medical Research Council (Unit Programme number MC_UU_12015/6). LF is funded by the National Institute for Health Research (NIHR) Global Health Research Group and Network on Diet and Activity, for which funding from NIHR is gratefully acknowledged (grant reference 16/137/34). CG is funded by the Academy of Medical Sciences and the Wellcome Trust (Springboard—Health of the Public 2040, grant reference HOP001/1051). The paper was initially developed in the course of a visiting appointment as Thought Leader in Residence at the School of Public Health at the University of Sydney, for which the intellectual environment and financial support provided by the Prevention Research Collaboration is gratefully acknowledged. It was further developed under the auspices of the Centre for Diet and Activity Research (CEDAR), a UKCRC Public Health Research Centre of Excellence at the University of Cambridge, for which funding from the British Heart Foundation, Economic and Social Research Council, Medical Research Council, National Institute for Health Research and the Wellcome Trust, under the auspices of the UK Clinical Research Collaboration, is gratefully acknowledged (grant reference MR/K023187/1).

  • Disclaimer The views expressed in this publication are those of the authors and not necessarily those of the National Health Service (NHS), the NIHR, the Department of Health and Social Care or any other funder.

  • Competing interests None declared.

  • Patient consent for publication Not required.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement There are no data in this work.