Section: New Results


Participants : David James Sherman [correspondant] , Pascal Durrens, Tiphaine Martin, Natalia Golenetskaya, Florian Lajus.

The Tsvetok project in Magnome targets “scaling out” for data and computation, both to improve capacity for handling large volumes of data and to permit more automatic analysis of projects of the “comparative genomics of related species” type, where a set of genomes is sequenced and analyzed as part of the same process. Natalia Golenetskaya has designed and implemented a NoSQL schema through the identification of standard queries, definition of the appropriate query-oriented storage schema, and mapping of structured values to this schema. This prototype is being tested on an Apache Cassandra ring deployed in Magnome 's dedicated computing cluster.

Large-scale data-mining such as that required for comparative genomics is fundamentally data-parallel: an initial transformation is applied to every data object of a given type (such as genes or even individual nucleotides), then a statistical machine learning procedure is applied to the transformed data to produce a summary or to learn a classification function. Analyses of this kind are the design goal of the MapReduce paradigm [31] . Using Tsvetok as a generator for Apache Hadoop, Natalia is designing MapReduce solutions for the principal whole-genome and data-mining analyses used by Magnome for eukaryote and prokaryote comparative genomics.