A Bayesian Inference Framework to Reconstruct Transmission Trees Using Epidemiological and Genetic Data

Abstract
The accurate identification of the route of transmission taken by an infectious agent through a host population is critical to understanding its epidemiology and informing measures for its control. However, reconstruction of transmission routes during an epidemic is often an underdetermined problem: data about the location and timings of infections can be incomplete, inaccurate, and compatible with a large number of different transmission scenarios. For fast-evolving pathogens like RNA viruses, inference can be strengthened by using genetic data, nowadays easily and affordably generated. However, significant statistical challenges remain to be overcome in the full integration of these different data types if transmission trees are to be reliably estimated. We present here a framework leading to a bayesian inference scheme that combines genetic and epidemiological data, able to reconstruct most likely transmission patterns and infection dates. After testing our approach with simulated data, we apply the method to two UK epidemics of Foot-and-Mouth Disease Virus (FMDV): the 2007 outbreak, and a subset of the large 2001 epidemic. In the first case, we are able to confirm the role of a specific premise as the link between the two phases of the epidemics, while transmissions more densely clustered in space and time remain harder to resolve. When we consider data collected from the 2001 epidemic during a time of national emergency, our inference scheme robustly infers transmission chains, and uncovers the presence of undetected premises, thus providing a useful tool for epidemiological studies in real time. The generation of genetic data is becoming routine in epidemiological investigations, but the development of analytical tools maximizing the value of these data remains a priority. Our method, while applied here in the context of FMDV, is general and with slight modification can be used in any situation where both spatiotemporal and genetic data are available. In order to most effectively control the spread of an infectious disease, we need to better understand how pathogens spread within a host population, yet this is something we know remarkably little about. Cases close together in their locations and timing are often thought to be linked, but timings and locations alone are usually consistent with many different scenarios of who-infected-who. The genome of many pathogens evolves so quickly relative to the rate that they are transmitted, that even over single short epidemics we can identify which hosts contain pathogens that are most closely related to each other. This information is valuable because when combined with the spatial and timing data it should help us infer more reliably who-transmitted-to-who over the course of a disease outbreak. However, doing this so that these three different lines of evidence are appropriately weighted and interpreted remains a major statistical challenge. In our paper we present a new statistical method for combining these different types of data and estimating trees that show how infection was most likely transmitted between individuals in a host population. Because sequencing genetic material has become so affordable, we think methods like ours will become very important for future epidemiology.