MAGNOME - 2012 - Annual activity report

MAGNOME

MAGNOME - 2012

Project-Team Magnome

Members

Overall Objectives

Scientific Foundations

Application Domains

Software

New Results

Bilateral Contracts and Grants with Industry

Partnerships and Cooperations

Dissemination

Bibliography

Previous |

Home | Next next

Section: New Results

Big Data in comparative genomics

Participants : David James Sherman [correspondant] , Pascal Durrens, Natalia Golenetskaya, Florian Lajus, Tiphaine Martin.

Data growth in comparative genomics presents a significant scaling challenge that requires novel informatic methods. Increase in sequence data is already a challenge, but in addition, the relations between the biological objects increase supralinearly (geometrically in the worst case) for every linear increase in sequence data.

Magnome 's Tsvetok system proposes a highly-scalable distributed approach for data and computation in comparative genomics, targeting projects of the “comparative genomics of related species” type, where a set of genomes is sequenced and analyzed as part of the same process. Tsvetok combines a novel NoSQL storage schema with domain-specific MapReduce algorithms, to efficiently handle the fundamentally data-parallel analyses encountered in comparative genomics. Natalia Golenetskaya with Florian Lajus derived use cases from web site log analyses to identify standard queries, define an appropriate query-oriented storage schema, and map structured values to this schema. This was tested in Magnome 's dedicated computing cluster.

Natalia Golenetskaya furthermore defined new distributed algorithms for two important large-scale analyses in Magnome 's pipeline: systematic identification of gene fusion and fission events in eukaryote genomes (following [7] ), and large-scale consensus clustering for protein families (following [9] ). For fusions and fissions, she defined a new MapReduce algorithm that avoids graph-based analysis (which is notoriously slow in MapReduce), to achieve both significant speed ups and excellent scaling to much larger data sets. For protein family clustering, she defined a novel iterative sampling strategy that combines parallel clustering of submatrices of pairwise relations, to succesively approximate the result of a complete clustering, without the need to store the entire matrix of relations in memory.

Previous |

Home | Next next