Main

We used a deep sequencing approach to gain insight into the evolution of Ebola virus (EBOV) in Guinea from the ongoing West African outbreak. This was an approach based on analysis pipelines developed for a guinea-pig model of EBOV infection and Hendra virus infection of human and bat cells4,5. Here we use this approach to derive consensus EBOV genomes from individual patient samples that can be used to study viral genome evolution during the course of the outbreak. Viral genomes were derived primarily from blood samples that had been taken from patients in Guinea and sent to the European Mobile Laboratory (EMLab), deployed by the World Health Organisation within the Médecins Sans Frontières Ebola Treatment Centre Guéckédou in March 2014 to aid the diagnostic effort. With the permission of Guinean authorities a biobank of samples was assembled which had known provenance of EBOV infection. Linked to each sample were the following data: patient location (to district level), sample collection date, disease onset and outcome. The collection dates were a median of 4 days after the date of onset of symptoms. Baseline data was cleaned, formatted and imported into the Geographic Information System, ESRI ArcGIS. Statistical tools were used to generate tabular output and to join the numeric case data with the district level boundaries of Guinea, Liberia and Sierra Leone (district geometries freely available from http://www.gadm.org/) (Fig. 1a).

Figure 1: Geographical location, sequence read depth, and read depth vs Ct value of patient samples.
figure 1

a, Geographical location of patient samples. The origin of the sequenced samples (one sample per patient) from Guinea, Sierra Leone, and Liberia processed by EMLab Guéckédou are plotted as numbers of cases by district. EMLab data are overlaid on an Ebola outbreak distribution map where cumulative cases are plotted as a heat map (low (yellow) to high (brown)) of confirmed cases from March 2014 to January 2015. Case data sourced from World Health Organization (WHO) Ebola response situation reports (http://apps.who.int/ebola/en/ebola-situation-reports); Geographic Information Systems (GIS) data sourced from Environmental Systems Research Institute (ESRI) and Database of Global Administrative Areas (GADM; http://www.gadm.org/). b, Sequence depth per nucleotide position. The number of reads for each nucleotide position was plotted across the full length of the virus genome for each of the 179 virus isolates we analysed. In red is shown the uniformity of the depth across individual genomes, although the median number of reads per nucleotide position had a variation spanning over four log10 units. c, Linear regression of the log10 median sequence depth of each virus isolate versus the Ct value of the viral load as determined by qRT–PCR. Red dots indicate samples taken from patients who went on to survive EBOV infection and grey shaded dots are from patients who records suggest died from EBOV infection.

PowerPoint slide

The viral genome sequence was derived from RNA sequencing analysis of the patient samples with no pre-amplification of the viral genome. In general we selected a range of samples from both males and females of different ages and a fair representation of sequences for each month (Extended Data Fig. 1), and with Ct values less than 20 for EBOV RNA. In this selected patient cohort, with a relatively high viral load, there was approximately 80% mortality. The read depth mapping to the EBOV genome varied between samples and regions in the genome (Fig. 1b) and in general the number of sequence reads obtained for each genome correlated with the amount of viral load as determined by quantitative reverse-transcription PCR (qRT–PCR) (Fig. 1c).

Phylogenetic analysis revealed the dynamic nature of the epidemic and molecular change in the viral sequence (Fig. 2a). Several distinct lineages were identified, with an initial lineage A (Figs 2a, 3 and Extended Data Fig. 2) linked to early Guinean cases dating from March 2014 including the three original viruses published by Baize et al.2. A second lineage, B, emerged in May and June and comprises all the sequences from Gire et al.6 and the remainder of those described here. As the epidemic expanded, lineage A remained confined in Guinea from March to June 2014, except for one sequence from 18 July 2014. A single Liberian sequence from March 2014 grouped within this lineage. No further EBOV genomes that we sequenced from samples taken after July 2014 belonged to lineage A. This clade was likely to have been associated with the original outbreak in Guinea and was almost successfully contained in May 2014 by the interventions of the multi-agency response. Two clusters of Sierra Leone viruses described by Gire et al.6 (denoted by the authors as clusters SL1 and SL2), both of which contain later viruses from Guinea and Liberia, suggest continued spread across the border during this time. Early cases in SL1 and SL2 were both associated with a single funeral6, so it is possible that this event may have reignited the epidemic. Thereafter, lineage B spread into Guinea, Liberia and Sierra Leone. This lineage is associated with the large epidemics in these three countries and persisted into 2015. The spatiotemporal spread of these viruses based on the phylogenetic analysis presented in Figs 2a and 3 was summarized (Extended Data Fig. 3) and indicated how the virus may have spread between the neighbouring countries. There was no evidence from the data that increases or decreases in mortality were associated with any particular virus cluster (Extended Data Fig. 4).

Figure 2: Phylogenetic relatedness and nucleotide sequence divergence of EBOV isolates from the 2013–2015 outbreak.
figure 2

a, Phylogenetic relatedness of EBOV isolates. Phylogenetic tree inferred using MrBayes11 for full-length EBOV genomes sequenced from 179 patient samples obtained between March 2014 and January 2015. Displayed is the majority consensus of 10,000 trees sampled from the posterior distribution with mean branch lengths. Posterior support is shown for selected key nodes. Twenty-two samples originated in Liberia and were collected between March and August 2014 and six samples from Sierra Leone were obtained in June and July 2014. In our analysis we also included published sequences, including the three early Guinean sequences2 and 78 sequences described by Gire et al.6. A number of lineages predominantly circulating in Guinea are denoted as GN1–4 along with a uniquely Sierra Leone lineage (SL3) recognised in Gire et al.6. b, EBOV nucleotide sequence divergence from root of the phylogeny in Fig. 2a plotted against time of collection of each virus. The date of the first documented case near Meliandou in eastern Guinea is indicated by the red triangle.

PowerPoint slide

Figure 3: A time-scaled phylogenetic tree of 262 EBOV genomes from Guinea, Sierra Leone, Liberia and Mali.
figure 3

Shown is a maximum clade credibility tree constructed from 10,000 trees sampled from the posterior distribution with mean node ages. Clades described in Gire et al.6 are identified here (SL1, SL2 and SL3) as well as a number of lineages predominantly circulating in Guinea and posterior probability support is given for these. For certain key node ages, 95% credible intervals are shown by horizontal bars.

PowerPoint slide

The Bayesian time-scaled phylogenetic analysis estimated an average rate of evolution over the genome of 1.42 × 10−3 substitutions per site per year with 95% credible intervals of 1.22 × 10−3 and 1.62 × 10−3. Details of the model assumptions are given in the Methods section. This rate is lower than that initially described for the West African outbreak by Gire et al.6 but still higher than the long-term, between-outbreak rate of 0.8 × 10−3 estimated using viruses back to the 1976 Yambuku outbreak6. This apparent drop in rate of evolution between these two studies is consistent with the explanation provided by Gire et al.6 that the short sampling interval (March to June) provided insufficient time for the action of purifying selection. However, the much longer sampling interval in the present study may simply be providing a more precise estimate of the rate. It should be noted, however, that the between-outbreak rate will exclusively reflect transmission and evolution that has occurred in the non-human reservoir species, so may not be directly comparable to the rate within a human outbreak. We observed no evidence of a change in evolutionary rate over the course of the epidemic with the accumulation of genetic change having a linear relationship with time (Fig. 2b), confirming that the apparent decline in rate between the two studies is an observational phenomenon7 rather than a change in the virus.

The estimate of the date of the most recent common ancestor of the sampled viruses is mid-January 2014 (95% credible intervals 12 December 2013, 18 February 2014). Although this is an estimate of first transmission event that resulted in more than one lineage in our sample, this provides an upper bound on the date of emergence of the virus into the human population. This date estimate is consistent with the epidemiological tracing of the first suspected cases to December 20132.

Given the error-prone nature of EBOV genome replication we examined the potential amino acid variation in EBOV proteins from the start of our sample collection in March 2014 to January 2015. The location of amino acid changes on EBOV proteins and their relative representation in the 179 assembled genomes were compared to an isolate identified in March 2014 (ref. 2) (Fig. 4). While there is amino acid variation in all of the genomes sampled, there were very few changes in viral protein 30 (VP30), viral protein 40 (VP40) and viral protein 24 (VP24), and these changes are only in less than 2% of the genomes sampled. However, a single amino acid substitution in VP24 is associated with adaptation to a new host4,8, and this may be due to interactions with host-cell proteins9,10. While some of the variation may be attributed to a purely random molecular clock pattern, in GP, VP35, NP and L there are some amino acid variations that are present in over 15% of the genomes sampled. For example, in GP there is an A to V substitution in 70.5% of the genomes sampled compared to the reference genome. Implications of the mutations within GP in relation to immune escape of therapeutics and vaccines will need to be assessed in pseudotype neutralization assays using EBOV monoclonal antibodies and serum from people who have been vaccinated.

Figure 4: Position of non-synonymous amino acid variations in the 179 genomes analysed in this study compared to a reference sequence taken from March 2014 (KJ660346.2).
figure 4

Shown is the frequency of all amino acid positions that had variability and the substitution that occurred with the first single letter position indicating the reference sequence and the second position showing the variation. The percentage frequency in the 179 genomes is shown on the y axis. GP, glycoprotein; NP, nucleoprotein; L, RNA polymerase; VP, viral protein.

PowerPoint slide

Methods

No statistical methods were used to predetermine sample size. There was no randomization or blinding in selection of samples for sequencing.

Ethics statement

The National Committee of Ethics in Medical Research of Guinea approved the use of diagnostic leftover samples and corresponding patient data for this study (permit no. 11/CNERS/14). As the samples had been collected as part of the public health response to contain the outbreak in Guinea, informed consent was not obtained from patients.

Genome sequencing and consensus building

Viral genome sequence was derived from the RNA extracted for diagnostic purposes from blood samples in the field with no pre-amplification of the viral genome. These samples were processed by the EMLab and are detailed in Supplementary Table 1, which indicates sample name, geographical location, date of onset of symptoms, date sample was collected, and the Ct value of EBOV RNA at the date of test. The clinical status is also indicated as well as malaria co-infection where known. Extracted RNA was DNase treated with Turbo DNase (Ambion) using the rigorous protocol. RNA sequencing libraries were prepared from the resultant RNA using the Epicentre ScriptSeq v2 RNA-Seq Library Preparation Kit. Following 10–15 cycles of amplification, libraries were purified using AMPure XP beads. Each library was quantified using Qubit and the size distribution assessed using the Agilent 2100 Bioanalyzer. These final libraries were pooled in equimolar amounts using the Qubit and Bioanalyzer data with 9–10 libraries per pool. The quantity and quality of the pool was assessed by Bioanalyzer and subsequently by qPCR using the Illumina Library Quantification Kit from Kapa on a Roche Light Cycler LC480II according to manufacturer’s instructions. Each pool of libraries was sequenced on one lane of a HiSeq2500 at 2 × 125-bp paired-end sequencing with v4 chemistry.

The trimmed fastq files were first aligned to a copy of the human genome using Bowtie2 (ref. 12) and the unaligned reads were then mapped with Bowtie2 to a list of 3731 known viral genomes excluding EBOV genomes. The reads that were still unmapped were then aligned to the EBOV genome—either the prototype strain isolated in Zaire in 1976 (AF086833.2) or a strain isolated during the current outbreak (KJ660348.2). For this step we again used Bowtie2 and the resultant alignment files were filtered with samtools to remove unmapped reads and reads with a mapping quality score below 11, followed by filtering with markdup to remove PCR duplicates. The resultant BAM file was then analysed by Quasirecomb13 to generate a phred-weighted table of nucleotide frequencies which were parsed with a custom perl script to generate a consensus genome in fasta format. This consensus genome was then used as a reference genome to which we remapped the sequence reads which did not map to the human genome or other viruses in order to generate a second consensus. In this way we were able to manually determine if the reference genome used by Bowtie2 influenced the process of calling a consensus genome. In addition, we used FreeBayes to independently call and identify SNPs and indels. The pipeline is entirely open source and implemented in the Galaxy environment14, a Galaxy compatible workflow, novel scripts and XML wrappers needed for implementation in Galaxy are freely available and included in Supplementary Data File 1. Sequence alignment maps were manually inspected and curated over regions with consistent low coverage (for example, at the 5′ ends).

Phylogenetic analysis

Phylogenetic analysis comprised the 179 EBOV genomes from this study, 78 genomes from Sierra Leone6, three sequences from Guinea2 and two sampled from Mali15. The genomes were partitioned into four sets of sites—1st, 2nd and 3rd codon positions of the protein-coding regions and the non-coding intergenic regions—with each partition being assigned a generalized time reversible substitution model16, gamma distributed rate heterogeneity17 and a relative rate of evolution. This model was used to construct a Bayesian nucleotide divergence tree (Fig. 2) using MrBayes11 and a time-scaled phylogenetic analysis (Fig. 3) using BEAST18 with a log-normal distributed relaxed molecular clock19, and the ‘Skygrid’ non-parametric coalescent tree prior20. The alignments and control files for both analyses are available in Supplementary Data Files 2 and 3 and provide documentation of all model parameters.