BONSAI - 2017 - Annual activity report

BONSAI

BONSAI - 2017

Project-Team Bonsai

Personnel

Overall Objectives

Presentation

Research Program

Application Domains

Life Sciences and health

Highlights of the Year

New Software and Platforms

New Results

Partnerships and Cooperations

Dissemination

Bibliography

Previous |

Home | Next next

Section: New Results

Large-scale sequencing data indexing

Petabytes of DNA and RNA sequencing data are currently stored in online databases. It is currently possible to access these databases in two ways: 1) metadata queries, such as organism, instrument type, etc, and 2) download raw data. Due to the sheer size of the data, the web servers do not offer the possibility to search for sequences inside datasets. Such an operation would be invaluable to biology investigators, for example to determine which experiments contain an organism of interest, high expression of a certain transcript, a certain mutation, etc. Prior work exists for indexing sequencing data (Bloom Filter Tries, Sequence Bloom Trees), yet the performance remains prohibitive (either high memory usage, or several days for performing certain queries).

We proposed a new formalism, the Allsome Sequence Bloom Trees [28]. It improves upon Sequence Bloom Trees in terms of construction time (by 50%) and query time (by 40-85%), and also permits dataset-vs-dataset searches. The method has been tested by indexing a subset of 2,652 RNA-seq human experiments from the Sequence Read Archive. Allsome Sequence Bloom Trees pave the way towards "Google" searches of petabytes of sequencing data.

Previous |

Home | Next next