We deploy optimized implementations of a selected suite of tools for long-read NGS data processing to the cloud. You can access these tools individually through our platform, but we also build customized workflows.
These are tools for consensus calling and polishing, including a cloud HPC implementation of the Circular Consensus Sequencing (CCS) tool to produce Highly Accurate Single-Molecule Consensus Reads (HiFi Reads). A bam file of raw subreads of ~400Gb can be processed in 4 hours, and one of ~600Gb in 5 hours.
Genome mapping and assembly
We are currently running minimap2 and NGMLR for mapping, and Canu and RaGOO for genome assembly. These tools have been deployed in a cluster configuration of ec2 instances that minimize computing cost and processing time.
A number of tools for variant calling of single nucleotide variation, small Indels and large structural variants are available. Currently we are running Assemblytics, Sniffles, and a suite of our tools for haplotype calling built around our own algorithms, which can produce Sanger sequencing equivalent accuracy from either CCS reads or raw uncorrected subreads. There are situations in which one or the other input dataset would be more convenient.
Metagenomics is challenging because multiple genetic variants co-exist in the same sample. PacBio's hifi approach is helping a lot with deconvoluting the genetic composition of these mixtures, but challenges still remain in terms of low frequency variants, and reconstruction of long haplotypes. Long-read metagenomics can be of great use for identifying new Biosynthetic Gene clusters, and we are actively working on new solutions.
Sequencing and data analysis costs of long-read become virtually prohibitive at large-scale. We have developed new approaches for producing highly accurate variant calling from low coverage data. We are piloting this on a project seeking to uncover hidden correlates to disease in patients with different clinical conditions.