Article Text

Download PDFPDF

Evaluations of water, sanitation and hygiene interventions should not use diarrhoea as (primary) outcome
  1. Samuel I Watson1,
  2. Ryan T T Rego2,
  3. Timothy Hofer3,
  4. Richard J Lilford1
  1. 1Institute of Applied Health Research, University of Birmingham, Birmingham, UK
  2. 2Center for Global Health Equity, University of Michigan, Ann Arbor, Michigan, USA
  3. 3Institute for Healthcare Policy and Innovation, University of Michigan, Ann Arbor, Michigan, USA
  1. Correspondence to Dr Samuel I Watson; s.i.watson{at}


Water, sanitation and hygiene interventions have been the subject of cluster trials of unprecedented size, scale and cost in recent years. However, the question ‘what works in water, sanitation, hygiene (WASH)?’ remains poorly understood. Evaluations of community interventions to prevent infectious disease typically use lab-confirmed infection as a primary outcome; however, WASH trials mostly use reported diarrhoea. While diarrhoea is a significant source of morbidity, it is subjected to significant misclassification error with respect to enteric infection due to the existence of non-infectious diarrhoea and asymptomatic infection. We show how this may lead to bias of estimated effects of interventions from WASH trials towards no effect. The problem is further compounded by other biases in the measurement process. Alongside testing for infection of the gut, an examination of the causal assumptions underlying WASH interventions present several other reliable alternative and complementary measurements and outcomes. Contemporary guidance on the evaluation of complex interventions requires researchers to take a broad view of the causal effects of an intervention across a system. Reported diarrhoea can fail to even be a reliable measure of changes to gastrointestinal health and so should not be used as a primary outcome if we are to progress our knowledge of what works in WASH.

  • epidemiology
  • cluster randomized trial
  • infections, diseases, disorders, injuries

Data availability statement

There are no data in this work.

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Summary box

  • Despite several recent cluster trials of unprecedented size, scale and cost evaluating water, sanitation, hygiene (WASH) interventions, the question ‘what works in WASH?’ remains poorly understood.

  • Evaluations of community interventions to prevent infectious disease typically use lab-confirmed infection as a primary outcome; however, WASH trials mostly use reported diarrhoea.

  • Diarrhoea is a significant source of morbidity, but it is subjected to significant misclassification error with respect to enteric infection due to the existence of non-infectious diarrhoea and asymptomatic infection.

  • We show how misclassification of diarrhoea leads to a bias of estimated effects of interventions from WASH trials towards no effect, which is compounded by further biases in the measurement process.

  • Reported diarrhoea can fail to be a reliable measure of changes to gastrointestinal health and so should not be used as a primary outcome if we are to progress our knowledge of what works in WASH.


Diarrhoeal disease remains one of the most prolific killers of children under 5.1 2 The predominant strategy to prevent these deaths is to improve water, sanitation and hygiene (WASH) infrastructure and related behaviours.3 Significant efforts are focused in low and middle-income countries (LMICs) where the burden of disease is highest. However, despite huge investment into intervention development and evaluation, the answer to the question ‘what works in WASH?’ remains poorly understood.4 5

In the last few years, three cluster trials of unprecedented size, scale and cost—WASH-Benefits Bangladesh,6 WASH-Benefits Kenya7 and the Sanitation Hygiene Infant Nutrition Efficacy (SHINE) trials8—found little evidence of benefit from any of the WASH components of the different interventions. Some commentaries have suggested that the interventions, which are among the most common types of WASH interventions in LMIC settings,3 were of inadequate intensity, poorly tailored to the local modes of disease transmission and possibly not acceptable to the targeted community.4 5 While we would agree with these conclusions in general, in this article, we will argue that there is an additional problem with these and many other trials of WASH interventions: the use of reported diarrhoea as the primary outcome measurement to assess effectiveness.

Diarrhoea as an outcome

Diarrhoea is typically defined as three or more loose or watery stools in 24 hours period and it has both infectious and non-infectious causes.9 There are several different methods used to measure episodes of diarrhoea among the under-fives. By far, the most common is retrospective self-report in which caregivers or family members are asked to recount whether an infant has had diarrhoea in the preceding period (typically between 24 hours and 2 weeks).10 The Demographic and Health Surveys, UNICEF’s Multiple Indicator Cluster Surveys11 and the three large trials mentioned above use this community survey method. Other approaches include direct observation of collected stool samples by field workers,12 prospective diary-based methods10 or recording hospitalisation rates for diarrhoeal disease at local health centres and hospitals.13–15

WASH interventions are designed to interrupt faecal–oral transmission of pathogens in order to prevent enteric infection and subsequent diarrhoeal disease. So the typical objective of a WASH trial is to demonstrate that an intervention causes a reduction in diarrhoea rates. However, the only type of diarrhoea that can directly be reduced by WASH interventions is infectious diarrhoea. Our contention is that measuring all episodes of diarrhoea instead of infectious diarrhoea or enteric infection may result in significant measurement error, which limits its usefulness for assessing WASH interventions.

The problem with diarrhoea

Diarrhoea is the primary clinical presentation of symptomatic diarrhoeal disease and a direct cause of significant morbidity through dehydration and malnutrition. So it might seem like an obvious outcome to assess changes in diarrhoeal disease rates. However, this same logic of measuring an outcome with multiple causes does not apply to trials in other disease areas. For example, randomised trials of interventions to reduce transmission of COVID-19 do not use cough or febrile illness as an outcome, nor do interventions targeted at reducing human papillomavirus examine vaginal bleeding as an outcome; they use laboratory confirmed outcomes, including PCR-based methods16 17. To see why, we can think of symptom reporting as a form of diagnostic test.

Consider diarrhoea and enteric infection by pathogenic organisms: some people are carriers so not all cases of enteric infection present with diarrhoea (the ‘sensitivity’ is below 100%), and some people have diarrhoea in the absence of enteric infection (the ‘specificity’ is below 100%). Table 1 shows an example 2×2 contingency table. If we use the prevalence (or rate) of diarrhoea to try to estimate the prevalence (or rate) of enteric infection, our estimate will be biased. If p is the true prevalence of infection and Embedded Image is our estimate of the prevalence based on diarrhoea, then:

Embedded Image

Table 1

A two-by-two contingency table for enteric infection and diarrhoea

where Embedded Image is the sensitivity and Embedded Image the specificity and both are between 0 and 1. Any measure based on this biased estimator, such as rate or risk ratios, will itself be biased. For example, if Embedded Image and Embedded Image are the true prevalences in the control and treatment arms of a trial, respectively, then:

Embedded Image

which is not equal to Embedded Image, the true relative risk, unless the specificity is 100%. Indeed, the worse the sensitivity and specificity are, the more the ratio will be biased towards 1, that is, no effect.

The problem described here is often referred to as classification error, which is the more specific description for measurement error that occurs when using dichotomous or categorical measurements.18 Neuhaus19 demonstrated that in logistic regression misclassification in the response variable introduces both a loss of efficiency and a highly biased attenuation of the effect estimators.19 This bias will be present when estimating any benefit of WASH interventions when diarrhoea due to infection outcome is misclassified.

Evidence on the ‘diagnostic performance’ of diarrhoea

The problem goes away if the sensitivity and specificity of diarrhoea are close to 100%. Unfortunately, these values are unknown as they likely vary between countries and contexts as well as between methods of ascertaining diarrhoea rates. The lack of certainty alone should generate caution when using these outcomes. However, recent evidence suggests that they are very unlikely to be close to 100%. For example, the two-by-two contingency table above can also be used to derive an OR (AD/BC in table 1): if sensitivity and specificity were 100%, we would expect the ORs to tend to infinity, that is, be very large as B and C would be 0. Relatively small ORs, therefore, indicate poor ‘diagnostic performance’. A recent systematic review of case control studies comparing enteropathogen presence in mostly hospitalised cases of diarrhoea versus controls without diarrhoea found ORs for different pathogens to predominantly fall in the range of 0.5–5.0,20 providing evidence of misclassification bias that differs by pathogen. Only very aggressive pathogens like Vibrio cholerae had very large ORs (around 50). In a recent study we conducted in the Cox’s Bazar camps for Forcibly Displaced Rohingya Population for Myanmar in Bangladesh, we estimated the all-pathogen sensitivity and specificity of carer-reported diarrhoea to be 0.49 (95% CI 0.39 to 0.66) and 0.65 (0.41 to 0.85), respectively.12 The incidence of diarrhoea was unusually high in this setting and if sensitivity and specificity vary with prevalence, then these results might underestimate sensitivity and specificity in other settings.

Figure 1

Relationship between baseline prevalence, sensitivity and specificity and the estimated relative risk. In all cases, the true relative risk is 0.5.

The diarrhoea prevalences reported in the three large trials6–8 were 5% to 10%, which puts a requisite lower limit on the specificity of 90% to 95%. However, even with these more optimistic figures, we might still suspect quite significant bias. Figure 1 shows a hypothetical example in which the true relative risk of enteric infection between a treatment (eg, water and sanitation improvement) and a control group is 0.5 and the baseline prevalence of infection is either 10% or 25%. We show how different values of sensitivity and specificity of a diarrhoea outcome affect the estimated relative risk of the study. In the case where the baseline is 10%, even if sensitivity and specificity are as high as 90%, the estimated relative risk is attenuated from 0.5 to 0.78. The same effect would be apparent for relative risks greater than 1 as well.

The problems above are further compounded when using self-reported and survey-based measures of diarrhoea (due to additional measurement error introduced by the difference between ‘diarrhoea’ and ‘reported diarrhoea’ in figure 2). Our aforementioned study compared agreement statistics (Cohen’s d) for, among other methods, a standard retrospective recall survey, a survey augmented with pictorial representations of different stools and visual inspection by trained field workers; we estimated values of between −0.1 and 0.1, indicating very poor agreement.12 Changes in the length of recall period or the frequency of questioning can also affect estimated rates of diarrhoea.21–23 Therefore, it is very likely that the above three WASH trials, and other comparable studies, have underestimated any intervention effect. Alternatives to survey-based diarrhoea assessment also have their own issues. For example, hospital-reported rates are low, occurring about once for every 50 cases of carer-reported diarrhoea, suggesting that this measurement suffers from severe underascertainment of community cases and/or under-reporting by hospitals.10 The biases we describe exist in addition to others that may affect trials in this area, such as selective attrition or selection bias.

Figure 2

Simplified, illustrative causal diagram linking diarrhoeal disease intervention to outcomes with examples of such interventions.

Alternatives to diarrhoea as a primary outcome

The aim of a WASH intervention is to prevent symptomatic disease and morbidity.24 25 One might argue that rates of infection with pathogenic bacteria are, therefore, only of instrumental importance, while symptomatic illness is of primary clinical importance and as such the presence of infectious diarrhoea should be the ‘primary outcome’. However, this argument fails on two fronts. First, with few exceptions, studies that use reported diarrhoea rates only measure whether the symptom is present and do not confirm the underlying infection status, leading to the misclassification problem described above. Second, newer models of gut health and the microbiome suggest that the presence of enteric pathogens reflects a significant loss of colonisation resistance, which may have clinical significance due to the immunologic and microbiome changes that led to the loss of resistance and which themselves can increase the pathogenic potential of any non-commensal gut resident.26 27 Asymptomatic infected people may also be an important reservoir for pathogens that cause symptomatic infections in other people in areas where water and sanitation are inadequate.28 29 WASH interventions, by reducing transmission, may well, therefore, significantly reduce the amount of asymptomatic infection. Thus, preventing infection, whether symptomatic or not, has further relevance to lowering morbidity and mortality, so it cannot be claimed to be only of instrumental importance. We should, therefore, consider alternatives.

Process outcomes

Figure 2 shows a simplified and illustrative causal model for the effects of an intervention designed to tackle diarrhoeal disease with some examples of WASH interventions. The first set of outcomes are the immediate, ‘upstream’ effects, such as changing behaviour. In the language of complex interventions, these are often called ‘process outcomes’, but it could also be referred to as an upstream ‘mediating variable’ in line with the burgeoning statistical and epidemiological literature on causal modelling.30 Some cluster trials of specifically behavioural WASH interventions have used these as primary outcomes.31 32 The three large WASH trials captured behavioural outcomes generally as measures of adherence to the intervention, and while they show improvement over time in an intervention cluster, they stop short of formally comparing them between intervention and control. The largest effects of the intervention are likely to be seen on these process outcomes33 and they are relatively inexpensive to collect; however, some assessments of behaviour may be subjective and, thus, subject to similar biases as diarrhoea. Influencing the process outcomes is also only a necessary, but not sufficient, condition for an effect to materialise on the more downstream outcomes.

Short-term epidemiological and clinical outcomes

We then have the short-term epidemiological outcomes, such as enteric infection and diarrhoea. Direct assessment of enteric infection is much less common in WASH trials than diarrhoea; however, there are some notable examples.34 35 A secondary analysis of the SHINE trial data published in a separate article examined enteric infection captured from stool samples in the trial. They found evidence of reduced prevalence of parasites, but little evidence of change in viral and bacterial carriage rates.36 Enteric infection presents an attractive option as it is ‘objective’ in the sense of being lab based rather than survey based.

One potential barrier to the use of microbiological outcomes, such as the presence of gut pathogens, is their cost and resource requirements. Stool testing requires the storage and refrigerated shipping of large numbers of samples to a lab equipped with trained staff and expensive equipment. Indeed, in many LMIC settings, such lab facilities are not available at the required scale. One could limit their inferences to the more upstream outcomes in these circumstances or reduce the sample size or number of pathogens to test to reduce costs. Alternatives include sample pooling and environmental surveillance.37 Another alternative may be rapid field tests, including immunochromatographic assays, to establish infection. While these tests also have imperfect sensitivity and specificity, their diagnostic performance can be established in a lab and used to adjust or correct results at the end by using the imperfect model described above.18 38 Indeed, we are conducting a pilot study of such a data collection process.

Long-term developmental outcomes

Finally, there are the long-term developmental outcomes around physiological and cognitive development of which linear growth is frequently reported.5 39 Long-term health and well-being outcomes may be preferred as they are intrinsically valuable, whereas the other outcomes may be considered only of instrumental value. While obviously important, these long-term outcomes result from the confluence of a range of factors. Any effect here is likely to be small and hidden among significant noise. The shorter term epidemiological outcomes might, therefore, represent a good trade-off.

There are evidently many potential outcomes a trial could use, which are often captured. However, most WASH trials use only a single ‘primary’ outcome on which the main conclusion of the trial is based, which may be a consequence more of the requirements of null hypothesis significance testing,40 rather than a principled approach to scientific investigation of WASH.

An important corollary to this discussion is that it is difficult to ascertain the effectiveness of a complex intervention, or an intervention in a complex causal path, by looking only at one outcome, especially one with significant measurement error. For example, consider a behavioural change WASH intervention that aims to educate caregivers about improving hygiene and reducing contamination of food and water. If we choose diarrhoea as a single primary outcome and find little evidence of an effect of the intervention, there is little we can infer about the intervention’s effectiveness as small relative effects with diarrhoea as an outcome can be compatible with larger reductions in enteric infection. Further to that, the intervention may have been very successful as a behaviour change intervention. For example, caregivers might have adopted handwashing and water chlorination. But unbeknownst to the researchers the primary transmission pathway for enteric pathogens was geophagy or another alternative. The design of the methods of education was not at fault, it was the subject of the training that was poorly aligned with the context. The lack of contextualising information and observations from the causal chain between intervention and clinical outcome means there is little opportunity to triangulate evidence and interpret findings.

An update to the influential framework for designing and evaluating complex interventions by the UK’s Medical Research Council was recently published.41 They identify four different but overlapping research perspectives and questions for complex interventions: efficacy, effectiveness, theory-based and systems’ perspective. The latter three are of most relevance here, which we can summarise as: does the intervention produce the intended effects in real-world settings? What works in which circumstances and how? And, how do the system and intervention adapt to one another? We would argue that the ‘intended effects’ are often at several points in a causal pathway, such as changing behaviour, to reduce water and food contamination and, hence, the transmission of enteric pathogens and symptomatic illness (figure 2). It is, therefore, only by looking at these different outcomes that we can answer the effectiveness question, and in so doing start to answer the theory-based and systems questions. As the guidance describes, no trial provides a simple yes/no answer to the question ‘did it work?’, especially when the trial is examining interventions in complex systems.41

Trials of community interventions to tackle other infectious diseases can also provide useful exemplars for the WASH community. For example, a recent trial of mask wearing in 600 clusters incorporating over 340 000 people in Bangladesh used symptom reporting alongside seroprevalence studies and adherence measures, particularly mask wearing, as outcome measures.42 The researchers could both demonstrate an increase in mask wearing and a subsequent decrease in seroprevalence.

Most methods to correct for misclassification bias in an outcome require independent knowledge of the sensitivity and specificity of the measurement method. The misclassification errors associated with diarrhoea as an outcome are variable and difficult to estimate as we have described. While there are now methods that will handle misclassification error to produced unbiased estimates of treatment effects, it is always better still to improve the measurement.43 Evidently, there may be a bias-variance trade-off to make: between a small, relatively uncertain but unbiased trial, or a large, ‘certain’ but biased one. However, we believe that the stronger consequence of our argument is that the reliance of a single outcome, particularly if it is diarrhoea, should be abandoned in favour of approaches that respect the complex nature of the intervention and system, and that allow for triangulation of the evidence across the causal pathway.


One explanation for the slow progress on reducing the risk of diarrhoeal disease in many LMICs may be that a solution, involving large-scale water and sewerage infrastructure, is unobtainable in many low-resource settings without significant external investment. In recent years, there have been many innovative technological solutions proposed for aspects of WASH like faecal sludge management, and access to clean drinking water and food preparation. Altogether, a package of such measures might provide significant relief in some settings that lack large-scale public health infrastructure. To identify what to include in such a successful programme, the question ‘what works in WASH?’ needs to be better answered using the best possible methods and measurements. We have argued that future trials in this area should not use survey-based diarrhoea as the primary outcome to avoid bias and inappropriate conclusions about the effects of an intervention. For a WASH intervention to be successful, it must cause a ‘domino effect’ across multiple mediating outcomes, such as behaviour change and interruption of pathogen transmission. Failure to reduce diarrhoea may or may not result from any one of these effects. Even if a trial were to demonstrate that an intervention causes a reduction in adverse clinical outcomes, even unbiased ones, the nature of the complex system means we may further struggle to generalise these findings and to other settings. Trial outcomes should be chosen from the causal pathway to better understand how an intervention functions or fails to do so, and in what context.

Data availability statement

There are no data in this work.

Ethics statements

Patient consent for publication



  • Handling editor Seye Abimbola

  • Twitter @siwatson

  • Contributors SIW and RL conceived the idea for the manuscript. SIW prepared the first draft. RL, TH and RTTR contributed to rewrites and edits.

  • Funding Medical Research Council (MR/V038591) and National Institute for Health Research (ARC West Midlands, EP/V028936).

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.