AMIB - 2011 - Annual activity report

AMIB

AMIB - 2011

Project Team Amib

Members

Overall Objectives

Scientific Foundations

Software

New Results

Partnerships and Cooperations

Dissemination

Bibliography

Previous |

Home | Next next

Section: New Results

Combinatorics and Annotation

Word counting and random generation

Cis-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors, or clusters. Formally, such sites can be viewed as words co-occurring in the DNA sequence. This gives rise to the problem of calculating the statistical significance of the event that multiple sites, recognized by different factors, would be found simultaneously in a text of a fixed length. A long-term research on word enumeration has been realized by the team. An extension to Hidden Markov Models has been realized recently in a collaboration with M. Roytberg (Impb , Puschino, Russia). It relies on a new concept of overlap graphs that efficiently overcomes the main difficulty - overlapping occurrences - in probabilities computation. This is part of E. Furletova's thesis, to be defended soon. An implementation is available at http://server2.lpm.org.ru/bio . This algorithm provides a significant space improvement over a previous algorithm, AhoPro developed with our former associate team Migec . M. Régnier and S. Sheikh have addressed combinatorial problems on clumps that should allow further space decrease and large deviation results were presented at Mccmb'11 .

An other application of word combinatorics has been started this year. During his internship, L. Pei (Paris-Sud 11 U.) provided a pipeline that simulates a random generation of reads and assembles them using Mira software. This work will be pursued by D. Iakovishina in her thesis. It is a collaboration with Magnome at Inria-Bordeaux and IOGene in Moscow.

A previous work [36] , published in 2010, generalized Boltzmann samplers to multivariate objects, allowing for the efficient random generation achieving a fixed or approximate composition for context-free languages. However, the performances of such algorithms were only guaranteed in the case of strongly-connected context-free grammars. In a recent collaboration with O. Bodini, H. Tafat and C. Banderier (LIPN, Paris-XIII) we are working on characterizing the distributions arising from simply connected grammars. In a short paper accepted for presentation at the Analco'12 conference [24] , we showed that: i) a large class of distributions can be reached for the number of occurrence of a single letter, arguably the simplest observable pattern; ii) simple grammars/regular expressions can be built that realize these distributions; iii) Classic Boltzmann samplers remain largely unaffected by this diversity.

Our work on random generation has applications in software testing and model-checking, in a collaboration with the Fortesse group at LRI [13] , [29] .

RNA combinatorics

Pseudoknots are usually ignored by popular software for RNA prediction. This means that, even under the daring assumptions of an unique and well-defined fold for RNA, coupled with a perfectly accurate energy model, the real structure of RNA will not be recovered perfectly. In a collaboration between Amib members and S. Janssen (Universität Bielefeld), we investigated the practical implications of such a limitation. We used RNAFold , a popular software for the prediction of RNA structure on representative sequences of the Rfam database, which groups known RNA sequences into about 2000 functional families. We observed that 12% of RFAM families exhibited a total absence of overlap between predicted structures and manually-curated structures, derived from experimental or evolutionary data. Combination of Rfam annotations, a survey of literature, and a newly developed predictive method for the presence of a functional pseudoknots, we were able to validate that a large majority of the mispredicted families featured evidence of pseudoknots in the functional conformation. Preliminary results were presented by B. Raman at the Fifth Indo-French Bioinformatics Meeting [34] .

In 2004, Condon and coauthors gave a hierarchical classification of exact RNA structure prediction algorithms according to the generality of structure classes that they handle. In [19] , we completed this classification by adding two recent prediction algorithms. More importantly, we precisely quantified the hierarchy by giving closed or asymptotic formulas for the theoretical number of structures of given size $n$ in all the classes but one. This allows to assess the tradeoff between the expressiveness and the computational complexity of RNA structure prediction algorithms.

Similar decompositions can be used for the design of algorithms that include tractable subclasses of pseudoknots. In [30] Y. Ponty and C. Saule extended a unifying framework introduced by Roytberg and Finkelstein to design ensemble RNA algorithms. This framework uses a family of hypergraphs to describe the conformation space, allowing for a clear separation between the search space, i.e. the set of admissible conformations, and the intended application (Minimal Free-Energy folding, partition function, statistical sampling...). We illustrated the promises of such an approach by explicitly rephrasing three major search spaces within the framework, and introduced an algorithm for computing the moments of any additive feature in the Boltzmann distribution.

By comparing empirical observations with the expected behavior of a model, combinatorial methods can be used to identify an evolutionary pressure weighing on RNA. In a collaboration with P. Clote (Boston College/Digiteo ) [11] , we used analytic combinatorics to study the expected distance between both ends of an RNA molecule, or $5^{'}$ - $3^{'}$ distance. Postulating a Boltzmann distribution on all secondary structures, we showed that this parameter is bounded by a – typically small – constant value when the sequence length goes to the infinity. Computing this quantity on a database of experimentally-determined secondary structures, we observed that the $5^{'}$ - $3^{'}$ distances take larger values than those predicted from the model. Furthermore, quite surprisingly, this quantity was shown to correlate positively with the length. We concluded by hypothesizing that the secondary structure of RNA may be under evolutionary pressure to fold in a modular way, creating independent domains on the exterior face.

Data integration

Recent years have seen a revitalization of Data Integration research in the Life Sciences. But the perception of the problem has changed: While early approaches concentrated on handling schema-dependent queries over heterogeneous and distributed databases, current research emphasizes instances rather than schemas, tries to place the human back into the loop, and intertwines data integration and data analysis. In this domain, the contribution of Amib in 2011 has been three folds: First, we have followed our collaboration with Ulf Leser (invited in the Amib group at Lri during 6 months in 2010) and have worked on the review of the past and current state of data integration for the Life Sciences and discussed recent trends in detail, which all pose various challenges for the database community in [28] .

Additionally, we have worked on a vision of what should be done by workflow systems to make it possible to search, adapt, and reuse scientific workflows, the complete state-of-the-art on this domain has been provided [12] . Second, in close collaboration with oncologists from the Institut Curie and the Children's Hospital of Philadelphia we have worked on the problem of ranking genes of interest associated to a given disease. The software GeneValorization has been designed and developed in this context and is able to provide a concise view of the literature available associated to a list of genes [10] . A second aspect of this research has been the design of a consensus ranking method, BioConsert, able to make the most (ie underline common points) of a set of established rankings [27] . This last point has been done in close collaboration with Sylvie Hamel invited professor in our group in 2010 (2 months). Third, we have presented a simple logical query language called RL for expressing different kinds of rules, especially well-suited to express association rules for transcriptomic data. In that context the challenge is to find out relationships between genes that reflect observations of how expression level of each gene affects those of others. The conjecture that association rules could be a model for the discovery of gene regulatory networks has already been partially validated. Nevertheless, several different kinds of rules between genes could be useful with respect to some biological objectives and we have designed a framework in which biologists may define their "own customized semantics" for rules with regard to their requirements. We have studied how the RL language behaves with respect to the well-known Armstrong's axioms [22] . The main contribution of this paper is to exhibit a restricted form of RL-queries, yet with a good expressive power, for which Armstrong's axioms are sound. From this result, this sublanguage turns out to have structural and computational properties which have been shown to be very useful in data mining, databases and formal concept analysis.

Previous |

Home | Next next