EN FR
EN FR


Section: Scientific Foundations

Annotation and Combinatorics

Word counting

Participants : Alain Denise, Daria Iakovishina, Mireille Régnier, Saad Sheikh, Jean-Marc Steyaert.

We aim at enumerating or generating sequences or structures that are admissible in the sense that they are likely to possess some given biological property. Team members have a common expertise in enumeration and random generation of combinatorial structures. They have developped computational tools for probability distributions on combinatorial objects, using in particular generating functions and analytic combinatorics. Admissibility criteria can be mainly statistic; they can also rely on the optimisation of some biological parameter, such as an energy function.

The ability to distinguish a significant event from statistical noise is a crucial need in bioinformatics. In a first step, one defines a suitable probabilistic model (null model) that takes into account the relevant biological properties on the structures of interest. A second step is to develop accurate criteria for assessing (or not) their exceptionality. An event observed in biological sequences, is considered as exceptional, and therefore biologically significant, if the probability that it occurs is very small in the null model. Our approach to compute such a probability consists in an enumeration of good structures or combinatorial objects. Thirdly, it is necessary to design and implement efficient algorithms to compute these formulae or to generate random data sets. Two typical examples that motivate research on words and motifs counting are Transcription Factor Binding Sites, TFBSs, consensus models of recoding events and some RNA structural motifs. The project has a significant contribution in word enumeration area. When relevant motifs do not resort to regular languages, one may still take advantage of combinatorial properties to define functions whose study is amenable to our algebraic tools. One may cite secondary structures and recoding events.

Random generation

Participants : Alain Denise, Yann Ponty.

Analytical methods may fail when both sequential and structural constraints of sequences are to be modelled or, more generally, when molecular structures such as RNA structures have to be handled. The random generation of combinatorial objects is an alternative, yet natural, framework to assess the significance of observed phenomena. General and efficient techniques have been developed over the last decades to draw objects uniformly at random from an abstract specification. However, in the context of biological sequences and structures, the uniformity assumption fails and one has to consider non-uniform distributions in order to obtain relevant estimates. Typically, context-free grammars can handle certain kinds of long-range interactions such as base pairings in secondary RNA structures. Stochastic context-free grammars (SCFG's) have long been used to model both structural and statistical properties of genomic sequences, particularly for predicting the structure of sequences or for searching for motifs. They can also be used to generate random sequences. However, they do not allow the user to fix the length of these sequences. We developed algorithms for random structures generation that respect a given probability distribution on their components. For this purpose, we first translate the (biological) structures into combinatorial classes, according to the framework developed by Flajolet et al. Our approach is based on the concept of weighted combinatorial classes, in combination with the so-named recursive method for generating combinatorial structures. Putting weights on the atoms allows to bias the probabilities in order to get the desired distribution. The main issue is to develop efficient algorithms for finding the suitable weights. An implementation is given in the GenRGenS software http://www.lri.fr/~genrgens/ .

Recently a new paradigm appeared is in ab initio secondary structure prediction   [38] : in place of classical optimization algorithms, the new approach relies on probabilistic algorithms, based on statistical sampling within the space of solutions. Indeed, we have done significant and original progress in this area recently [3] , [19] , including combinatorial models for structures with pseudoknots. Our aim is to combine this paradigm with a fragment based approach for decomposing structures, such as the cycle decomposition by F. Major's group  [42] .

Besides, our work on random generation is also applied in a different fields, namely software testing and model-checking, in collaboration with the Fortesse group at Lri  [13] , [29] .

Knowledge extraction

Participants : Jérôme Azé, Jiuqiang Chen, Sarah Cohen-Boulakia, Christine Froidevaux.

Our main goal is to design semi-automatic methods for annotation. A possible approach is to focus on the way we could discover relevant motifs in order to make more precise links between function and motifs sequence. For instance, a commonly accepted hypothesis is that function depends on the order of the motifs present in a genomic sequence. Likewise we must be able to evaluate the quality of the annotation obtained. This necessitates giving an estimate of the reliability of the results. This may use combinatorial tools described above. It includes a rigorous statement of the validity domain of algorithms and knowledge of the results provenance. We are interested in provenance resulting from workflow management systems that are important in scientific applications for managing large-scale experiments and can be useful to calculate functional annotations. A given workflow may be executed many times, generating huge amounts of information about data produced and consumed. Given the growing availability of this information, there is an increasing interest in mining it to understand the difference in results produced by different executions.

Systems Biology

Participants : Patrick Amar, Mahsa Behzadi, Sarah Cohen-Boulakia, Christine Froidevaux, Loic Paulevé, Sabine Peres, Mireille Régnier, Jean-Marc Steyaert.

Systems Biology involves the systematic study of complex interactions in biological systems using an integrative approach. The goal is to find new emergent properties that may arise from the systemic view in order to understand the wide variety of processes that happen in a biological system. Systems Biology activity can be seen as a cycle composed of theory, computational modelling to propose a hypothesis about a biological process, experimental validation, and use of the experimental results to refine or invalidate the computational model (or even the whole theory).

We concentrate on the computational modelling step of the cycle by developing a computer simulation system, Hsim , that mimics the interactions of biomolecules in an environment modelling the membranes and compartments found in real cells. In collaboration with biologists from the Ammis lab. at Rouen we have used Hsim to show the properties of grouping the enzymes of the phosphotransferase system and the glycolytic pathway into metabolons in E. coli. In another collaboration with the SysDiag Lab (UMR CNRS 3145) at Montpellier, we participate at the CompuBioTic project. This is a Synthetic Biology project in the field of medical diagnosis: its goal is to design a small vesicle containing specific proteins and membrane receptors. These components are chosen in a way that their interactions can sense and report the presence in the environment of molecules involved in human pathologies. We used Hsim to help the design and to test qualitatively and quantitatively this "biological computer" before in vitro.

We participate in a research project eSignal with Inra (Asam , collaboration with Inra-Bios laboratory) that aims at providing unique tools allowing to decipher and model the most proximal layer of biological systems: intracellular biochemical networks. More precisely we are interested in GPCRs (G protein-coupled receptors) trigger complex signalling networks that are involved in a wide array of physio-pathological processes. As such, GPCRs are targeted by almost half of the currently marketed drugs. As systems biology has developed experimental means to generate massive quantities of high quality data, there is a need for computational methods to integrate these data in predictive dynamic models. Amib group aims at building an innovative pipeline of computational methods covering all the tasks needed to go from the initial data to predictive dynamic models of intracellular signalling mechanism.

A cooperation with an Inserm-Inra team based in Clermont-Ferrand addresses the behaviour of biological systems. A mathematical approach is currently being developed to study stability of some sub-domains, the importance of initial conditions that are to be inferred. This involves data analysis of experimental facts and a comparative analysis. Discrete approaches are relevant here, to cope with the combinatorial explosion of dynamics to explore, and analyze reachability properties within large networks. A software is developed to enhance the scalability of the parameters inference.