Section: New Results
A spectral algorithm for fast de novo layout of uncorrected long nanopore reads
Seriation is an optimization problem that seeks to reconstruct an ordering between variables from pairwise similarity information. It can be formulated as a combinatorial problem over permutations and several algorithms have been derived from relaxations of this problem. We make the link between the seriation framework and the task of de novo genome assembly, which consists of reconstructing a whole DNA sequence from small pieces of it that are oversampled so as to cover the full genome. To achieve this task, one has to find the layout of small pieces of DNA sequences (reads). This layout step can be cast as a seriation problem. We show that a spectral algorithm for seriation can be efficiently applied to a genome assembly scheme.
New long read sequencers promise to transform sequencing and genome assembly by producing reads tens of kilobases long. However their high error rate significantly complicates assembly and requires expensive correction steps to layout the reads using standard assembly engines.
We present an original and efficient spectral algorithm to layout the uncorrected nanopore reads, and its seamless integration into a straightforward overlap/layout/consensus (OLC) assembly scheme. The method is shown to assemble Oxford Nanopore reads from several bacterial genomes into good quality ( identity to the reference) genome-sized contigs, while yielding more fragmented assemblies from a Sacharomyces cerevisiae reference strain. See software in https://github.com/antrec/spectrassembler.