Chen, I.A.et al.IMG/M: integrated genome and metagenome comparative data analysis system. Nucleic Acids Res.45, D507–D516 (2017).
Mukherjee, S.et al.Genomes OnLine Database (GOLD) v.6: data updates and feature enhancements. Nucleic Acids Res.45, D446–D456 (2017).
Angly, F.E.et al.The marine viromes of four oceanic regions. PLoS Biol.4, e368 (2006).
Breitbart, M., Miyake, J.H. & Rohwer, F.Global distribution of nearly identical phage-encoded DNA sequences. FEMS Microbiol. Lett.236, 249–256 (2004).
Breitbart, M. & Rohwer, F.Here a virus, there a virus, everywhere the same virus?Trends Microbiol.13, 278–284 (2005).
Marhaver, K.L., Edwards, R.A. & Rohwer, F.Viral communities associated with healthy and bleaching corals. Environ. Microbiol.10, 2277–2286 (2008).
Suttle, C.A., Chan, A.M. & Cottrell, M.T.Use of ultrafiltration to isolate viruses from seawater which are pathogens of marine phytoplankton57, 721–726 (1991).
Dell'Anno, A., Corinaldesi, C., Magagnini, M. & Danovaro, R.Determination of viral production in aquatic sediments using the dilution-based approach. Nat. Protoc.4, 1013–1022 (2009).
Thurber, R.V., Haynes, M., Breitbart, M., Wegley, L. & Rohwer, F.Laboratory procedures to generate viral metagenomes. Nat. Protoc.4, 470–483 (2009).
Brum, J.R.et al.Ocean plankton. Patterns and ecological drivers of ocean viral communities. Science348, 1261498 (2015).
Dinsdale, E.A.et al.Functional metagenomic profiling of nine biomes. Nature452, 629–632 (2008).
Mizuno, C.M., Rodriguez-Valera, F., Kimes, N.E. & Ghai, R.Expanding the marine virosphere using metagenomics. PLoS Genet.9, e1003987 (2013).
Roux, S.et al.Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. Nature537, 689–693 (2016).
Akhter, S., Aziz, R.K. & Edwards, R.A.PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res.40, e126 (2012).
Fouts, D.E.Phage_Finder: automated identification and classification of prophage regions in complete bacterial genome sequences. Nucleic Acids Res.34, 5839–5851 (2006).
Lima-Mendez, G., Van Helden, J., Toussaint, A. & Leplae, R.Prophinder: a computational tool for prophage prediction in prokaryotic genomes. Bioinformatics24, 863–865 (2008).
Arndt, D.et al.PHASTER: a better, faster version of the PHAST phage search tool. Nucleic Acids Res.44, W16–W21 (2016).
Roux, S., Enault, F., Hurwitz, B.L. & Sullivan, M.B.VirSorter: mining viral signal from microbial genomic data. PeerJ3, e985 (2015).
Grazziotin, A.L., Koonin, E.V. & Kristensen, D.M.Prokaryotic Virus Orthologous Groups (pVOGs): a resource for comparative genomics and protein family annotation. Nucleic Acids Res.45, D491–D498 (2017).
Paez-Espino, D.et al.Uncovering earth's virome. Nature536, 425–430 (2016).
Ivanova, N.et al.A call for standardized classification of metagenome projects. Environ. Microbiol.12, 1803–1805 (2010).
Mukherjee, S.et al.Genomes OnLine Database(GOLD) v.6: data updates and feature enhancements. Nucleic Acids Res.45, D446–D456 (2016).
Paez-Espino, D.et al.IMG/VR: a database of cultured and uncultured DNA viruses and retroviruses. Nucleic Acids Res.45, D457–D465 (2017).
Merchant, N.et al.The iPlant Collaborative: cyberinfrastructure for enabling data to discovery for the life sciences. PLoS Biol.14, e1002342 (2016).
Suttle, C.A.Marine viruses—major players in the global ecosystem. Nat. Rev. Microbiol.5, 801–812 (2007).
Edwards, R.A., McNair, K., Faust, K., Raes, J. & Dutilh, B.E.Computational approaches to predict bacteriophage-host relationships. FEMS Microbiol. Rev.40, 258–272 (2016).
Villarroel, J.et al.HostPhinder: a phage host prediction tool. Viruses8http://dx.doi.org/10.3390/v8050116 (2016).
Goren, M.G., Yosef, I. & Qimron, U.Programming bacteriophages by swapping their specificity determinants. Trends Microbiol.23, 744–746 (2015).
Salmond, G.P. & Fineran, P.C.A century of the phage: past, present and future. Nat. Rev. Microbiol.13, 777–786 (2015).
Edgar, R.C.Search and clustering orders of magnitude faster than BLAST. Bioinformatics26, 2460–2461 (2010).
Enright, A.J., Van Dongen, S. & Ouzounis, C.A.An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res.30, 1575–1584 (2002).
Katoh, K., Misawa, K., Kuma, K. & Miyata, T.MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res.30, 3059–3066 (2002).
Finn, R.D., Clements, J. & Eddy, S.R.HMMER web server: interactive sequence similarity searching. Nucleic Acids Res.39, W29–W37 (2011).
Chen, I.A.et al.IMG/M: integrated genome and metagenome comparative data analysis system. Nucleic Acids Res.45, D507–D516 (2016).
Camacho, C.et al.BLAST+: architecture and applications. BMC Bioinformatics10, 421 (2009).
Dutilh, B.E.et al.A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nat. Commun.5, 4498 (2014).
Aziz, R.K., Dwivedi, B., Akhter, S., Breitbart, M. & Edwards, R.A.Multidimensional metrics for estimating phage abundance, distribution, gene density, and sequence coverage in metagenomes. Front. Microbiol.6, 381 (2015).
Li, H. & Durbin, R.Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics25, 1754–1760 (2009).
Genome sequencing technologies continue to develop with remarkable pace, yet analytical approaches for reconstructing and classifying viral genomes from mixed samples remain limited in their performance and usability. Existing solutions generally target expert users and often have unclear scope, making it challenging to critically evaluate their performance. There is a growing need for intuitive analytical tooling for researchers lacking specialist computing expertise and that is applicable in diverse experimental circumstances. Notable technical challenges have impeded progress; for example, fragments of viral genomes are typically orders of magnitude less abundant than those of host, bacteria, and/or other organisms in clinical and environmental metagenomes; observed viral genomes often deviate considerably from reference genomes demanding use of exhaustive alignment approaches; high intrapopulation viral diversity can lead to ambiguous sequence reconstruction; and finally, the relatively few documented viral reference genomes compared to the estimated number of distinct viral taxa renders classification problematic. Various software tools have been developed to accommodate the unique challenges and use cases associated with characterizing viral sequences; however, the quality of these tools varies, and their use often necessitates computing expertise or access to powerful computers, thus limiting their usefulness to many researchers. In this review, we consider the general and application-specific challenges posed by viral sequencing and analysis, outline the landscape of available tools and methodologies, and propose ways of overcoming the current barriers to effective analysis.
metagenomics, assembly, next-generation sequencing, classification, surveillance, epidemic
In the last decade, at least seven separate viral outbreaks have caused tens of thousands of human deaths (Woolhouse, Rambaut, and Kellam, 2015), and the ever-increasing density of livestock, rate of habitat destruction, and extent of human global travel provides a fertile environment for new pandemics to emerge from host switching events (Delwart 2007,; Fancello, Raoult, and Desnues 2012), as was the case for SARS, Ebola, Middle East Respiratory Syndrome (MERS), and influenza-A (H1N1) (Castillo-Chavez et al. 2015). At present we have a limited grasp of the extent of viral diversity present in the environment: the 2014 database release from the International Committee for the Taxonomy of Viruses classified just 7 orders, 104 families, 505 genera, and 3286 species (http://www.ictvonline.org/virustaxonomy.asp); yet, one study estimated that there are at least 320,000 virus species infecting mammals alone (Anthony et al. 2013).
High throughput (or so-called ‘next generation’) sequencing of viruses during the most recent outbreaks of MERS in South Arabia (Gire et al. 2014; Carroll et al. 2015; Park et al. 2015) and Ebola in West Africa (Quick, J et al. 2016) has facilitated rapid identification of transmission chains, rates of viral evolution, and evidence of the zoonotic origin of these outbreaks. Access to such information during initial stages of an outbreak would offer invaluable insight into when, where, and how an epidemic might emerge, informing intervention and mitigation measures or even stopping it altogether. A major step towards this goal is therefore to identify existing zoonotic and environmental pathogens with pandemic potential. This is a significant undertaking, demanding considerable investment and close collaboration between government, NGOs and academia, for example, the USAID program PREDICT http://www.vetmed.ucdavis.edu/ohi/predict/index.cfm, as well as on the ground surveillance by local authorities and scientists in areas of the world most at risk.
The characterization of unknown viral entities in the environment is now possible with modern sequencing; however, current tooling for exploiting these data represents a practical and methodological bottleneck for effective data analysis. Practically, most available software tools are inaccessible to the majority of potential users, demanding expertise and computing resources often lacked by the researchers from diverse backgrounds involved in sample collection, sequencing, and analysis. There is a need for robust and intuitive analytical tools without requirements for fast internet connectivity, which may be unavailable in remote or developing regions. More fundamentally, the intended scope of published analytical tools and workflows is often less than clear, and given the diverse applications of viral sequencing, it can be difficult to gauge the relevance of newly published tools without first testing them. For example, a fast sequence classifier might fail entirely to detect a novel strain of a well-characterized virus, and equally might perform well with Illumina sequences yet deliver poor results for data generated with the Ion Torrent platform. Furthermore, results arising from these analyses should be replicable, intelligible, and useful to the end user, with provision for quality control and error management. Software tools that target expert users should be tested, documented and robustly distributed as packages or containers so as to streamline the processes of installation and generating results.
Methodologically, most genomic sequence analysis software is not well suited for viral genomes. Generic tools that are able to address the challenges posed by viral sequences are often applicable only in limited circumstances. Choosing between approaches is made difficult due to an abundance of disparate yet functionally equivalent methodologies and in general a lack of rigorous benchmarks for viral datasets. While there is much ongoing research in this area, both the sensitive detection of previously characterized viruses and viral discovery remain key challenges open for innovation. Here we survey the landscape of available approaches for analyzing both known and unknown viruses within genomic and metagenomic samples, with focus on their practical and methodological suitability for use by a broad spectrum of researchers seeking to characterize viral metagenomes.
2. Viral sequence enrichment: physical and insilico approaches
Within metagenomes the proportion of viral nucleic acids is typically far lower than that of host or other microbes, limiting the amount of signal available for analysis after sequencing. To mitigate this issue, enrichment and amplification approaches are widely used prior to sequencing viral samples. Size filtration or density-based enrichment by centrifugation are two effective methods for increasing virus yield, although such methods may bias the observed composition of viral populations (Ruby, Bellare, and Derisi 2013). Alternatively, PCR amplification may be used to generate an abundance of specific viral sequences present in a sample, a widely used strategy, which was employed in the identification and analysis of MERS coronavirus (Zaki et al. 2012,; Cotten et al. 2013, 2014), although effective primer design can be challenging in the presence of high genomic diversity in the target viral species. Conversely, an excess of sequencing coverage can lead to the construction of overly complex and unwieldy de novo assembly graphs in the presence of high genomic diversity, reducing assembly quality. Using in silico normalisation (Crusoe et al. 2015), excess coverage may be reduced by discarding sequences containing redundant information. This approach increases analytical efficiency when dealing with high coverage sequence data, and we have shown that it can benefit de novo assembly of viral consensus sequences. Another in silico strategy for increasing analytical efficiency by discarding unneeded data is to filter sequences from known abundant organisms through alignment with one or more reference genomes using an aligner or specialist tool (approaches reviewed in Daly et al. 2015).
3. Choosing a sequencing platform
There are several sequencing technologies in widespread use that are capable of reading hundreds of thousands to billions of DNA sequences per run (Reuter, Spacek, and Snyder 2015). The current market leader, Illumina, manufactures instruments capable of generating billions of 150 base pair (bp) paired end reads (see ‘Glossary’) per run, with read lengths of up to 300 bp. The Illumina short read platform is widely used for analyses of viral genomes and metagenomes, and, given sufficient sequencing coverage, enables sensitive characterization of low-frequency variation within viral populations (e.g. HIV resistance mutations as low as 0.1% (Li et al. 2014)). Ion Torrent (ThermoFisher) is capable of generating longer reads than Illumina at the expense of reduced throughput and a higher rate of insertion and deletion (indel) error (Eid et al. 2009). Single molecule real-time sequencing commercialized by Pacific Biosciences (PacBio) produces much longer (>10 kbp) reads from a single molecule without clonal amplification, which eliminates the errors introduced in this step. However, this platform has a high (∼10%) intrinsic error rate, and remains much more expensive than Illumina sequencing for equivalent throughput. The Nanopore platform from Oxford Nanopore Technologies, which includes the pocket sized MinION sequencer, also implements long read single molecule sequencing, and permits truly real-time analysis of individual sequences as they are generated. Although more affordable than PacBio single molecule sequencing, the Nanopore platform also suffers from high error rates in comparison with Illumina (Reuter, Spacek, and Snyder 2015). However, the technology is maturing rapidly and has already demonstrated potential to revolutionize pathogen surveillance and discovery in the field, as well as enabling contiguous assembly of entire bacterial genomes at relatively low cost (Feng et al. 2015; Quick et al. 2015; Hoenen et al. 2016). Hybrid sequencing strategies using both long and short reads leverage the ability of long reads to resolve repetitive DNA regions while benefitting from the high accuracy of short reads, at the expense of additional sequencing, library preparation and data analysis (Madoui et al. 2015).
4. Assembling genomes: denovo and reference-based assembly
The reconstruction of sequencing reads into full length genes and genomes can be performed by means of either reference-based alignment or de novo assembly, a decision dependent on experimental objectives, read length, quality and data complexity. In reference-based approaches, reads are mapped to similar regions of a supplied template genome, a well-studied and computationally efficient process implemented with a suffix array index of the reference genome. In contrast, de novo assembly is computationally exhaustive but important in cases where either a target genome is poorly characterized or reconstruction of genomes of a priori unknown entities in metagenomes is sought, such as in surveillance studies. For short read data, the increased sequence length afforded by assembly can be necessary to distinguish members of highly conserved gene families from one another. Assembly is also widely used for generating whole genome consensus sequences to facilitate analyses of viral variation, and is a typical starting point for analyses of diverse populations of well-characterized viruses. Even where long reads are available, assembly plays an important role in mitigating the high error rates associated with single molecule sequencing technologies, yielding accurate consensus sequences from inaccurate individual reads.
4.1 Denovo assembly methodologies
Modern de novo assemblers generally leverage either de Bruijn graphs or read overlap graphs as part of the approach known as overlap layout consensus (OLC). Figure 1 illustrates the differences between the two methods. OLC assemblers use the similarity of whole reads in order to construct a graph wherein each read is represented by a node, and subsequently merge overlapping reads into consensus contigs (Deng et al. 2015). OLC is relatively time and memory intensive, scaling poorly to millions of reads and beyond. However, the fewer, longer reads generated by emerging single molecule sequencing technologies tend to be well suited to OLC assembly, which can be easily implemented to tolerate long and noisy sequences (Compeau, Pevzner, and Tesler 2011). Older, notable, de novo assemblers implementing OLC include CAP3 (Huang and Madan 1999) and Celera (http://www.jcvi.org/cms/research/projects/cabog/overview/), while MHAP (Berlin et al. 2015), Canu (Berlin et al. 2015), and Miniasm (Li 2016) represent the current state of the art. There also exist a number of OLC assemblers intended for use with viral sequences: VICUNA was designed for short, non-repetitive and highly variable reads from a single population (Yang et al. 2012), and PRICE (Ruby, Bellare, and Derisi, 2013) iteratively assembles low to moderate complexity metagenomes (e.g. Runckel et al. 2011; Grard et al. 2012;) using a similar algorithm to the actively developed consensus assembler IVA (Hunt et al. 2015), which like VICUNA is designed for single virus populations rather than metagenomes (see Table 1 for additional details on programs).
Two widely used methodologies in de novo assembly of short reads. Reads are not represented explicitly within a de Bruijn graph; they are instead decomposed into distinct subsequence ‘words’ of length k, or k-mers, which can be linked together via overlapping k-mers to create an assembly graph. In OLC, a pairwise comparison of all reads is performed, identifying reads with overlapping regions. These overlaps are used to construct a read graph. Next, overlapping reads are bundled into aligned contigs in what is referred to as the layout step, before finally the most likely nucleotide at position is determined through consensus. This figure is simplified to demonstrate the theory for the assembly of single genomes; note that the process has additional complexities for the reconstruction of metagenomes.