Description of the methods ========================== .. image:: viral-ngs-overview.png Taxonomic read filtration ------------------------- Human, contaminant, and duplicate read removal ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The assembly pipeline begins by depleting paired-end reads from each sample of human and other contaminants using BMTAGGER_ and BLASTN_, and removing PCR duplicates using M-Vicuna (a custom version of Vicuna_). .. _BMTAGGER: http://ftp.ncbi.nih.gov/pub/agarwala/bmtagger/screening.pdf .. _BLASTN: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch .. _Vicuna: http://www.broadinstitute.org/scientific-community/science/projects/viral-genomics/vicuna Taxonomic selection ~~~~~~~~~~~~~~~~~~~ Reads are then filtered to to a genus-level database using LASTAL_, quality-trimmed with Trimmomatic_, and further deduplicated with PRINSEQ_. .. _LASTAL: http://last.cbrc.jp .. _Trimmomatic: http://www.usadellab.org/cms/?page=trimmomatic .. _PRINSEQ: http://prinseq.sourceforge.net Viral genome analysis --------------------- Viral genome assembly ~~~~~~~~~~~~~~~~~~~~~ The filtered and trimmed reads are subsampled to at most 100,000 pairs. *de novo* assemby is performed using SPAdes_. Reference-assisted assembly improvements follow (contig scaffolding, orienting, etc.) with MUMMER_ and MUSCLE_ or MAFFT_. Gap2Seq_ is used to seal gaps between scaffolded *de novo* contigs with sequencing reads. Each sample's reads are aligned to its *de novo* assembly using Novoalign_ and any remaining duplicates were removed using Picard_ MarkDuplicates. Variant positions in each assembly were identified using GATK_ IndelRealigner and UnifiedGenotyper on the read alignments. The assembly was refined to represent the major allele at each variant site, and any positions supported by fewer than three reads were changed to N. This align-call-refine cycle is iterated twice, to minimize reference bias in the assembly. .. _SPAdes: http://bioinf.spbau.ru/en/spades .. _MUMMER: https://mummer4.github.io/ .. _MUSCLE: https://www.drive5.com/muscle/ .. _MAFFT: http://mafft.cbrc.jp/alignment/software/ .. _Gap2Seq: https://www.cs.helsinki.fi/u/lmsalmel/Gap2Seq/ .. _Novoalign: http://www.novocraft.com/products/novoalign/ .. _Picard: http://broadinstitute.github.io/picard .. _GATK: https://www.broadinstitute.org/gatk/ Intrahost variant identification ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Intrahost variants (iSNVs) were called from each sample's read alignments using `V-Phaser2 `_ and subjected to an initial set of filters: variant calls with fewer than five forward or reverse reads or more than a 10-fold strand bias were eliminated. iSNVs were also removed if there was more than a five-fold difference between the strand bias of the variant call and the strand bias of the reference call. Variant calls that passed these filters were additionally subjected to a 0.5% frequency filter. The final list of iSNVs contains only variant calls that passed all filters in two separate library preparations. These files infer 100% allele frequencies for all samples at an iSNV position where there was no intra-host variation within the sample, but a clear consensus call during assembly. Annotations are computed with snpEff_. .. _snpEff: http://snpeff.sourceforge.net/ Taxonomic read identification ----------------------------- Metagenomic classifiers include Kraken_ and Diamond_. In each case, results are visualized with Krona_. .. _Kraken: https://ccb.jhu.edu/software/kraken/ .. _Diamond: https://ab.inf.uni-tuebingen.de/software/diamond .. _Krona: https://github.com/marbl/Krona/wiki