Amibis a joined team with LIX, Ecole Polytechnique and LRI, Paris-Sud XI University. The team has been created on May the 1 st, 2009 and is under evaluation.
This project in bioinformatics is mainly concerned with the molecular levels of organization in the cell, dealing principally with RNAs and proteins; we currently concentrate our efforts on structure, interactions, evolution and annotation and aim at a contribution to protein and RNA engineering. On the one hand, we study and develop methodological approaches for dealing with macromolecular structures and annotation: the challenge is to develop abstract models that are computationally tractable and biologically relevant. On the other hand, we apply these computational approaches to several particular problems arising in fundamental molecular biology. The trade-off between the biological accuracy of the model and the computational tractability or efficiency is to be addressed in a closed partnership with experimental biology groups.
We investigate the relations between nucleotide sequences, 3D structures and, finally, biochemichal function. All protein functions and many RNA functions are intimately related to the three-dimensional molecular structure. Therefore, we view structure prediction and sequence analysis as an integral part of gene annotation that we study simultaneously and that we plan to pursue on a RNAomic and proteomic scale. Our starting point is the sequence either ab initioor with some knowledge such as a 3D structural template or ChIP-Chip experiments. We are interested in deciphering information organization in DNA sequences and identifying the role played by gene products: proteins and RNA, including noncoding RNA. A common toolkit of computational methods is developed, that relies notably on combinatorial algorithms, mathematical analysis of algorithms and data mining. One goal is to provide softwares or platform elements to predict either structures or structural and functional annotation. For instance, a by-product of 3D structure prediction for protein and RNA engineering is to allow to propose sequences with admissible structures. Statistical softwares for structural annotation are included in annotation tools developped by partners, notably our associate team Migec.
Our work is organized along two main axes. The first one is structure prediction, comparison and design engineering. The relation between nucleotide sequence and 3D macromolecular structure, and the relation between 3D structure and biochemical function are possibly the two foremost problems in molecular biology. There are considerable experimental difficulties in determining 3D structures to a high precision. Therefore, there is a crucial need for efficient computational methods for structure prediction, functional assignment and molecular engineering. A focus is given on both protein and RNA structures.
The second axis is structural and functional annotation, a special attention being paid to regulation. Structural annotation deals with the identification of genomic elements, e.g. genes, coding regions, non coding regions, regulatory motifs. Functional annotation consists in characterizing their function, e.g. attaching biological information to these genomic elements. Namely, it provides biochemical function, biological function, regulation and interactions involved and expression conditions. High-throughput technologies make automated annotation crucial. There is a need for relevant computational annotation methods that take into account as many characteristics of gene products as possible -intrinsic properties, evolutionary changes or relationships- and that can estimate the reliability of their own results.
Most problems in computational biology are NP-hard as soon as all known and reasonablebiological information is taken into account. For instance, structural biology is concerned with 3D structures of complex molecules. Prediction, comparison and design are, in fact, three optimisation problems where these structures are classically represented by graphs and they are known to be NP-complete. A fruitful strategy consists in designing models that maintain the biological relevance while being simple enough to be computationnally tractable. The representation chosen determines the data structures and algorithms classes to be used. The challenge is to develop formal models, along with efficient algorithms, or heuristics, to deal with them. The various biological problems described above raise different computer science issues. To tackle them, the project members rely on a common methodology for which our group has a significant experience. Indeed, many of them can be expressed with classical combinatorial objects such as graphs, trees, words and grammars.
Common activity with P. Clote (Boston College and Digiteo).
Recodingrepresents several non conventional phenomena for the translation of messenger RNA (mRNA) into proteins, including frameshift, readthrough, hopping,where a single mRNA sequence allows the synthesis of (at least) two different polypeptides. Recoding is mandatory for many virus machinery and viability. We develop two complementary computational methods that aim to find genes subject to recoding events in genomes. The first one is based on a model for the recoding site ; the second one is based on a comparative genomics approach at a large scale. In both cases, our predictions are subject to experimental biological validation by our collaborators at Igm(Institut de Génétique et Microbiologie), Paris-Sud University. This work is funded by the ANR (project RNA-RECOD, ANR BLANC 2006-2010). Additionnally, we are currently developing a combinatorial approach, based on random generation, to design small and structured RNAs. Our goal is to build these RNAs such that their hybridization with existing mRNAs will be favorable to independent folding, and will therefore affect the stability of some secondary structures involved in recoding events. An application of such a methodology to the Gag-Pol HIV-1frameshifting site will be carried out with our collaborators at Igm. We hope that, upon capturing the hybridization energy at the design stage, one will be able to gain control over the rate of frameshift and consequently fine-tune the expression of Gag/Pol.
It has also been observed, mainly on bacteria, that some mRNA sequences may adopt an alternate fold. Such an event is called a riboswitch. A common feature of recoding events or riboswitches is that some structuralelements on mRNA initiate unusual action of the ribosome or allow for an alternate fold under some environmental conditions. One challenge is to predict genes that might be subject to riboswitches.
Another mid-term challenge is the design of molecules that enhance or repress such events.
Single strand RNA folds to a stable and compact structure. This folding leads to a secondary structure that is an intermediate structure level for RNA, between the single sequence and the full structure (tertiary structure). It is based on pairing between complementary bases (A-U and C-G). A recent classification , the Leontis-Westhof classification, distinguishes twelve different kinds of chemical bonds between two nucleotides, according to the way they are linked together within the tertiary structure. Other kinds of interactions are also taken into account, such as stacking, and phosphodiester bonds along the sequence. This knowledge turns out to be crucial to determine molecular stability. Moreover, some recent works on RNA biochemistry have shown that RNA molecules are structured by RNA tertiary motifs. These motifs, that are known from 3D structure, can be seen as “small bricks” that play a very important role in RNA structuration. Indeed, it was shown that taking these motifs into account can lead to improve significantly the 3D prediction methods. We develop graph algorithms for extracting tertiary motifs from RNA structures, and for predicting the tertiary structure from the sequence . This project, in collaboration with two groups from University of Strasbourg and University of Versailles, is funded by the ANR (project AMIS-ARN, ANR BLANC 2009-2012).
The function of many proteins depends on their interaction with one or many partners. Despite the improvements due to structural genomics initiatives, the experimental solving of complex
structures remains a difficult problem. The prediction of complexes,
docking, proceeds in two steps: a configuration generation phase or
explorationand an evaluation phase or
scoring. As the verification of a predicted conformation is time consuming and very expensive, it is a real challenge to reduce the time dedicated to the analysis of complexes by the
biologists. In a collaboration with A. Poupon,
Inra-Tours, a method that sorts the various potential conformations by decreasing probability of being real complexes has been developed. It
relies on a ranking function that is learnt by an evolutionary algorithm. The learning data are given by a geometric modelling of each conformation obtained by the docking algorithm
proposed by the biologists. Objective tests are needed for such predictive approaches. The
Critical Assessment of Predicted Interaction,
Capri, a community wide experiment modelled after
Caspwas set up in 2001 to achieve this goal (
http://
A protein amino acid sequence determines its structure and biological function, but no concise and systematic set of rules has been stated up to now to describe the functions associated to a sequence; experimental methods are time (and money) consuming. Massive genome sequencing has revealed the sequences of millions of proteins, whereas roughly 55.0003D protein structures, only, are known yet. Structure prediction in silicoattempts to fill up the gap. It consists in finding a tentative spatial (3D) conformation that a given nucleotidic or aminoacid sequence is likely to adopt. A second problem of interest is inverse protein foldingor computational protein design(CPD), that is the prediction of amino-acid sequences that adopt a particular target tertiary structure. This problem has many implications such as protein folding and stability, structure prediction (fold recognition), or protein evolution. Moreover, it is a mandatory step towards the design of new, artificial proteins. The engineering of protein-ligand interactions also has great biological and technological value. For example, the recent engineering of aminoacyl-tRNA synthetase (aaRS) enzymes has led to organisms with a modified genetic code, expanded to include nonnatural aminoacids.
Molecular dynamics(MD) simulations use numerical methods to study the motion of atoms, by far too complex for analytical studies. They were used by Biocfor extensive computational engineering of aaRS, aminoacyl-tRNA synthetases. For computational protein design, and structure prediction as well, a possible modelling considers the protein backboneand sidechains. This backbone structure may be known by high-resolution methods. High-quality models for sidechains interactions with solvent have been designed. There is a finite number of possible positions for sidechains, that may be memorized in a rotamerlibrary. A fitness or energyfunction that relies on atomistic and physical-chemical criteria is associated to each conformation. Therefore, one may search the set of possible sequences to optimize stability criteria.
Another novel ingredient is the use of negative design: the ability to select against sequences that have undesired properties, such as a tendency to fold into alternate, undesired structures. It can be critical for attaining specificity when competing states are close in (stability) structure space. There are also current efforts to enlarge this thermodynamical point of view by a new knowledge on natural proteins with known conformations.
Our goal is to predict the structure of different classes of barrel proteins. Those proteins contain the two large classes of transmembrane proteins, which carry out important functions. Nevertheless, their structure is yet difficult to determine by standard experimental methods such as X-ray cristallography or NMR. Most existing methods only address single-domain protein structures. Therefore, for large proteins, a preprocessing to determine the protein domains is necessary. Then, a suitable model of energy functions needs to be designed for each specific class. We have designed a pseudo-energy minimization method for the prediction of the super-secondary structure of -barrel or -helical-barrel proteins with structural knowledge-based enhancement. The method relies on graph based modelling and also deals with various topological constraints such as Greek key or Jelly roll conformations.
We aim at enumerating or generating sequences or structures that are admissiblein the sense that they are likely to possess some given biological property. Team members have a common expertise in enumeration and random generation of combinatorial structures. They have developped computational tools for probability distributions on combinatorial objects, using in particular generating functions and analytic combinatorics. Admissibility criteria can be mainly statistic; they can also rely on the optimisation of some biological parameter, such as an energy function.
The ability to distinguish a significant event from statistical noise is a crucial need in bioinformatics. In a first step, one defines a suitable probabilistic model (null model) that takes into account the relevant biological properties on the structures of interest. A second step is to develop accurate criteria for assessing (or not) their exceptionality. An event observed in biological sequences, is considered as exceptional, and therefore biologically significant, if the probability that it occurs is very small in the null model. Our approach to compute such a probability consists in an enumeration of good structures or combinatorial objects. Thirdly, it is necessary to design and implement efficient algorithms to compute these formulae or to generate random data sets. Two typical examples that motivate research on words and motifs counting are Transcription Factor Binding Sites, TFBSs, and consensus models of recoding events. The project has a significant contribution in word enumeration area. When relevant motifs do not resort to regular languages, one may still take advantage of combinatorial properties to define functions whose study is amenable to our algebraic tools. One may cite secondary structures and recoding events.
A starting project considers an algorithm of desambiguisation of automata, that uses the powerful techniques developed by Cyril Nicaud (
Igm-Marne-la-Vallée University) to generate random automata; An other appealing problem is the random walk problem, considered as a modelization of
ranked genes expression that could be used for medical diagnosis. In the mathematical setting, we want to know the probability that a random bridge of length
nwith increments
Xi= ( +
d, -
c)exits of a strip
-
H
y
H. The increments have expectation zero and it is possible to assume that they are independent, later on conditionning the walk to come back to zero at time
n. If the increments
Xnare bounded, the limit of the walk as
ntends to infinity is a Brownian bridge, the statistics of which is well known; however, practically, on one hand the value of
dmay be large, and on the other we are in the range of large deviations for small
p-values. For these reasons, it is necessary to consider the discrete case. Banderier and Flajolet provided in 2002 a large account on discrete random walks, although they do not consider
the heights of the walks. A collaboration has begun with Cyril Banderier (
Lipn, University Paris-North) on the subject; Nicolas Broutin (
Inria-Algorithms) and Thomas Feierl (joining
Inria-Algorithmson Dec. 1st) should join this collaboration. The bioinformatics aspects will be considered by Marcel Shulz (Max-Planck Institut
Berlin-Dahlem).
Analytical methods fail when both sequential and structural constraints of sequences are to be modelled or, more generally, when molecular structuressuch as RNA structures have to be handled. For these more complex models, an experimental approach ( i.e.a computational generation of random sequences) is still necessary. Typically, context-free grammars can handle certain kinds of long-range interactions such as base pairings in secondary RNA structures. Stochastic context-free grammars (SCFG's) have long been used to model both structural and statistical properties of genomic sequences, particularly for predicting the structure of sequences or for searching for motifs. They can also be used to generate random sequences. However, they do not allow the user to fix the length of these sequences. We developed algorithms for random structures generation that respect a given probability distribution on their components. For this purpose, we first translate the (biological) structures into combinatorial classes, according to the framework developed by Flajolet et al. Our approach is based on the concept of weightedcombinatorial classes, in combination with the so-named recursivemethod for generating combinatorial structures. Putting weights on the atoms allows to bias the probabilities in order to get the desired distribution. The main issue is to develop efficient algorithms for finding the suitable weights.
Our main goal is to design semi-automatic methods for annotation. A possible approach is to focus on the way we could discover relevant motifs in order to make more precise links between function and motifs sequence. Indeed, a commonly accepted hypothesis is that function depends on the order of the motifs present in a genomic sequence. Examples of relevant motifs can be frameshift motifs, RNA structural motifs, TFBS or PFAM domains. General tools must then be developed in order to assess the significance of the motifs found out. Likewise we must be able to evaluate the quality of the annotation obtained. This necessitates giving an estimate of the reliability of the results that includes a rigorous statement of the validity domain of algorithms and knowledge of the results provenance. We are interested in provenance resulting from workflow management systems that are important in scientific applications for managing large-scale experiments and can be useful to calculate functional annotations. A given workflow may be executed many times, generating huge amounts of information about data produced and consumed. Given the growing availability of this information, there is an increasing interest in mining it to understand the difference in results produced by different executions.
Varnais a new tool for the automated drawing, visualization and annotation of the secondary structure of RNA, designed as a companion software for web servers and databases. Varnaimplements four drawing algorithms, supports input/output using the classic formats dbn, ct, bpseqand RNAMLand exports the drawing as five picture formats, either pixel-based ( JPEG, PNG) or vector-based ( SVG, EPSand XFIG). It also allows manual modification and structural annotation of the resulting drawing using either an interactive point and click approach, within a web server or through command-line arguments.
In November 2009,
Varnais currently used by RNA scientists and websites such as the
NestedAlignweb server (
http://
This software, freely available at
http://
In a recent work published in 2004, Condon analyzed 5 recent algorithms that predict secondary structures with pseudoknots. Relying on rewriting rules, she characterized the classes of pseudoknots that may be predicted. A collaborative work , between Lixand Lriprovides an alternative combinatorial characterization by graphs, from which enumeration follows, and, additionnally, studies a new class. In the long term, one expects to add biological constraints to these combinatorial definitions.
Canonical secondary structures of RNA are those without lonely base pairs. Secondary structure prediction algorithms such as RNAfold, etc., claim to have greater accuracy in folding structures without lonely base pairs than with isolated pairs. B. Raman and P. Clote, ( Relative Accuracy of RNAfold to Rfam Consensus for Canonical Secondary Structures), validate this claim: RNAfoldimproves accuracy in canonical structures prediction. This is assessed by extensive experiments using RNA sequences obtained from RNA database Rfam. The accuracy of the RNAfoldalgorithm is evaluated with respect to the consensus secondary structure of each and every RNA family in the Rfamdatabase. This paper also points out that for certain families in the Rfamdatabase the consensus secondary structure is inaccurate.
Towards predicting the structure of a riboswitch, the first step is to extract from the genome sequence the complete RNA sequence, that is, both the aptamer and the expression platform of the riboswitch. To predict the structure after a target molecule binds to the aptamer of the riboswitch, it is also necessary to know the sequence and in turn the structure of the expression platform: then only we could identify the subsequences of the RNA involved in an alternate, stable riboswitch structure. The second step is to predict the secondary structure with the extracted RNA sequence such that the elements of the expected riboswitch family appears in the folded secondary structure. For example, in the aptamer portion of a TPP riboswitch there is a thi-box element, whose structure, and a significant portion of the sequence as well, is conserved in Prokaryotes and in some Eukaryotes). To achieve this, it is desirable to have a database containing the correct secondary structures of known riboswitches. The Rfamdatabase has a collection of riboswitch sequences with the consensus structure, and the sequences corresponds to just the aptamer portion. We developed a computational pipeline for generating accurate secondary structures for all TPP riboswitch entires in the Rfamdatabase. In thiswork, we use the software tools in pipeline to achieve the following: (a) retrieve sequences from genome banks corresponding to TPP riboswitch entries in Rfam, (b) locate the aptamer portion in the retrieved sequence, and (c) fold sequences to predict secondary structures that are accurate compared to the conserved structure in known TPP riboswitches.
A protein-protein docking procedure traditionally consists in two successive tasks: a search algorithm generates a large number of candidate solutions, and then a scoring function is used to rank them in order to extract a native-like conformation. We have already demonstrated that using Voronoi constructions and a defined set of parameters, we could optimize an accurate scoring function. However, the precision of such a function is still not sufficient for large-scale exploration of the interactome. This year we tried another construction: the Laguerre tessellation. It also allows fast computation without losing the intrinsic properties of the biological objects. Related to the Voronoi construction, it was expected to better represent the physico-chemical properties of the partners. In , we present the comparison between both constructions. In the recent years, we also worked on introducing a hierarchical structure of the original complex three-dimensional structures used for learning, obtained by clustering. Using this clustering model we can optimize the scoring functions and get more accurate solutions. This scoring function has been tested on Capriscoring ensembles, and an at least acceptable conformation is found in the top 10 ranked solutions in all cases. This work has been submitted for publication. It is part of the thesis of Thomas Bourquard .
A. Sedano has studied the inverse folding problem of proteins during her internship supervised by T. Simonson and J.-M. Steyaert: the classic problem of the fold recognition consists in predicting the threedimensional structure of a protein from its sequence of amino acids, using the modelling by homology. An additional approach consists in inverting this problem, and in raising the inverse folding problem: identify the most favorable sequences corresponding to a 3D structure, or given fold , . main question is to map the millions of protein sequences extracted from the genomes onto the tens of thousand known 3D structures. She applied methods of probability analysis, such as those of Ranganathan, Thirumalai or Nussinov to big sets of sequences of the family of domains PDZ(at first calculated then natural). These methods allow to determine what are the correlations between distant mutations in a structure. Later, these correlations should allow to describe in terms of sequence the signatureof a given structure. She also tried to test these methods by working not on mutations between amino acids but on mutations between classes of amino acids, to facilitate the comparisons between sites along the sequence.
Our algorithm
predicts first a super-secondary structure by dynamic programming.
This step runs in
for the common up-down topology, and at most
for the Greek key motifs, where
nis the number of amino acids. Finally, a predicted three-dimensional structure is built from the geometric criteria. The method has been tested on transmembrane
-barrel proteins and it reaches comparable efficiency with respect to previous approaches. It can be further improved by refining the energetic model, especially on turns and loops.
The structural model may be also refined since additional structural constraints may simplify the problem. The prediction accuracy, for the class of known
-barrel transmembrane proteins, evaluated as the percentage of well-labelled residues, reaches 70-85%. The number of strands is correctly predicted, whereas the shear number, the
second main geometric characteristic for a
-barrel, is relatively suitable. The method is being used to carry out screening experimentations on proteomic databases, eg. the
Parameciumbank, in a collaboration with Ph. Dessen (Institut Gustave Roussy).
Cis-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors, or clusters. Formally, such sites can be viewed as
wordsco-occurring in the DNA sequence. This gives rise to the problem of calculating the statistical significance of the event that multiple sites, recognized by different factors,
would be found simultaneously in a text of a fixed length. The main difficulty comes from overlapping occurrences of motifs. This is partially solved by our previous algorithm,
AhoPro.
OvGraph
and , developed with our associate team
Migec, intends to solve memory problems. We introduced a new concept of overlap graphs to count word occurrences and their probabilities. The
concept led to a recursive equation that differs from the classical one based on the finite automaton accepting the proper set of texts. In case of many occurrences, our approach yields the
same order of time and space complexity as the approach based on minimized automaton.
OvGraphalgorithm relies on traversals of a graph, whose set of vertices is associated with the overlaps of words from a set
. Edges define two oriented subgraphs that can be interpreted as equivalence relations on words of
. Let P be the set of equivalence classes and S be the set of other vertices. The run time for the Bernoulli model is
. In a Markov model of order K, additional space complexity is
O(
pm|
V|
K)and additional time complexity is
O(
npm|
V|
K). Our preprocessing uses a variant of Aho-Corasick automaton and achieves
time complexity. Our algorithm is implemented for the Bernoulli model and provides a significant space improvement in practice.
A new problem addressed by
MPV, developed with J. Bourdon (
LINA-Nantes and
Inria-Symbiose) and
Migec, is the significance assessment for motifs clusters. The classical method to study a set of motifs (defined, for instance, by their
Position Weight Matrices, PWM), computes a significance score for each motif in the sequence set to be studied and then choses (arbitrarily) a threshold to select the most
significant motifs (10 top motifs, motifs with a pvalue smaller than 5%,...). Such a type of choice makes very difficult to keep under control the number of false positive induced by this
selection. We have developed a method, that relies on generating functions, that allows to computes a significance criterium for the selection. Therefore, it provides the number of false
positive. Such an information is beyond the scope of other methods that correct the pvalues for multiple tests: Bonferroni,Benjamini-Hochberg,...A prototype is available on line
http://
Some related theoretical aspects have been considered by P. Nicodème. The non-reduced case of words statistics is considered, where words of the searched motif may be factors of other words of the motif. This is a joint work with Frédérique Bassino ( Lipn, University Paris-North) and Julien Clément ( Greyc, University of Caen); an article about this matter has been submitted to the Journal Transaction on Algorithms. Since DNA is a text sequence, it is ubiquitous to present the importance of analysis of suffix-trees. This latter analysis is often coupled with the analysis of tries. A joint work of P. Nicodème with Gahyun Park (University of Wisconsin), Hsien-Kuei Hwang (Academia Sinica, Taiwan) and Wojciech Szpankowski (University of Purdue) about Profiles of Tries has been published in the SIAM Journal on Computing .
The random generation of combinatorial objects is a alternative, yet natural, framework to assess the significance of observed phenomena. General and efficient techniques have been
developed over the last decades to draw objects uniformly at random from an abstract specification. However, in the context of biological sequences and structures, the uniformity assumption
fails and one has to consider non-uniform distributions in order to obtain relevant estimates. To that purpose we introduced a weighted random generation, which we previously implemented
within the
GenRGenSsoftware
http://
In this collaboration between two of the team members and M. Termier (
Igm-University Paris-Sud XI), we introduced and studied a generalization of the weighted models to general decomposable classes, defined using
different types of atoms
. We addressed the random generation of such structures with respect to a size
nand a targeted distribution in
kof its
distinguishedatoms. We consider two variations on this problem. In the first alternative, the targeted distribution is given by
kreal numbers
1, ...,
ksuch that
0<
i<1for all
iand
. We aim to generate random structures among the whole set of structures of a given size
n, in such a way that the
expectedfrequency of any distinguished atom
equals
i. We address this problem by weighting the atoms with a
k-tuple
of real-valued weights, inducing a weighted distribution over the set of structures of size
n. We first adapt the classical recursive random generation scheme into an algorithm taking
O(
n1 +
o(1)+
mnlog
n)arithmetic operations to draw
mstructures from the
-weighted distribution. Secondly, we address the analytical computation of weights such that the targeted frequencies are achieved asymptotically, i. e. for large values of
n. We derive systems of functional equations whose resolution gives an explicit relationship between
and
N. Lastly, we give an algorithm in
O(
kn4)for the inverse problem,
i.e.computing the frequencies associated with a given
k-tuple
of weights, and an optimized version in
O(
kn2)in the case of context-free languages. This allows for a heuristic resolution of the weights/frequencies relationship suitable for complex specifications. In the second
alternative, the targeted distribution is given by
knatural numbers
n1, ...,
nksuch that
where
r0is the number of undistinguished atoms. The structures must be generated
uniformly among the set of structures of size
nthat contain
exactly
niatoms
(
1
i
k). We give a
algorithm for generating
mstructures, which simplifies into a
for regular specifications.
These results provide new foundations and tools for tackling structural bioinformatics problems, such as RNA design. They are described in a manuscript submitted to Theoretical Computer Science.
Recent work by Forslund and Sonnhammer has investigated to which extent the hypothesis that protein function should follow largely from domain architecture can be true. They have shown
that domain functional interplay may not follow directly from the properties of the domains in isolation, and suggested that it could be interesting to take into account conservation of
sequential order of the domains. To achieve this, we have proposed a new method
, called
Snk(Sequential Nuggets of Knowledge)
http://
Identifying correspondences between concepts of two ontologies has become a crucial task for genome annotation. We have proposed O'Browser , a semi-automatic method to solve that issue in the case of two functional hierarchies. O'Browseris based on a classical ontology mapping architecture, but strongly uses expertise on the underlying domain. First, experts are asked to validate obvious correspondences discovered by O'Browserand to identify functional groups of concepts in the ontologies. Then, they are requested to validate the correspondences given by the combination of results found in the automatic steps of our system. These steps consist in matchers designed to fit the characteristics of the ontologies. Especially, we have introduced a new instance-based matcher which uses homology relationships between proteins. We also proposed an original notion of adaptive weighting for combining the different matchers. O'Browserhas been used to map concepts of Subtilistto concepts of FunCat, two functional hierarchies.
One of the most popular ways to access public biological data is using portals, like Entrez NCBI. Data entries are inspected in turn and cross-references between entries follow. However, this navigational process is so time-consuming and difficult to reproduce that it does not allow scientists to explore all the alternative paths available (even though these paths may provide new information). BioBrowsing is a tool providing scientists with data obtained when all the possible paths between NCBIsources have been followed (source paths generation is done by BioGuide). Querying is done on-the-fly (no warehousing). BioBrowsinghas a module able to update automatically the schema used by its query engine to consider the new sources and links which appear in Entrez. Finally, profiles can be defined as a way of focusing the results on userâs specific interests.
In this context, we have studied the problem of differencing two workflow runs with the same specification. Our contributions are three-fold: (i) while in general this problem is NP-hard, we have proposed to consider a natural restriction of graph structures (series-parallel graph overlaid with well-nested forking and looping) general enough to capture workflows encountered in practice; (ii) for this model of workflows, we have presented efficient, polynomial-time algorithms for differencing workflow runs , ; (iii) we have developed a prototype and conducted experimental results demonstrating the scalability of our approach.
RNA-RECOD, ANR BLANC 2006-2010: Influence of mRNA structures on ribosome accuracy. Normal decoding could be diverted by sequences and structures on the mRNA and led to recoding. Analysing these variations constitutes a powerful tool to understand the normal curse of action of the translational machinery. The four teams involved in the project develop complementary approaches that have previously allowed the identification of several elements involved in recoding. Very recently, using a cryo-eletromicroscopy approach, we deciphered for the first time the precise role of the pseudoknot in a -1 frameshifting event. The project gathers together several complementary approaches including biochemistry, genetics, molecular and structural biology and bioinformatics. The goal of the study is to i) compare the molecular mechanisms involved in several recoding events (-1 and +1 frameshifting, pyrrolysine incorporation), focusing on the associated structural modifications and ii) identify new recoding sites in genomes.
AMIS-ARN, ANR BLANC 2009-2012: Graph Algorithms and Automatic Softwares for Interactive RNA Structure Modelling. We aim to do substantial progress in the problem of automatically or semi-automatically modelling the three-dimensional structure of RNA molecules, given their sequence. By semi-automaticallywe mean developing algorithms and software that can automatically propose (good) solutions, and that can efficiently compute alternative solutions according to some new constraints or some new hypotheses given by the expert modeler. More precisely, we plan to work on the three following points: 1.Development of computational methods for solving some key steps necessary for modelling RNA 3D structures. These methods will rely on new graph algorithms for molecular structures and on biological expertise on sequence-structure relations in RNA molecules. 2.Implemention of these methods in a software suite, Paradise, which is being developed by one of the partners (E. Westhof's lab, Strasbourg University) and which will be made freely available to the scientific community. 3. Application of these methods in order to model several molecules of interest.
Lriand Inra-Migare partners in a one-year regional project Afon: Annotation FONctionnelle (Functional Annotation). The aim of the project is to design semi-automatic methods to help scientists in the task of functional annotation of prokaryotic genomes.
P. Clote (Boston College) has started a new activity on a Digiteochair about RNA properties, in particular concerned with folding energy distributions and the identification of riboswitches.
Migec, Mathématiques et Informatique en GEnomique Comparative (Mathematics and Computer Science in Comparative Genomics), is an associate team with NII-Genetika(Moscow, Russia). The goal of this cooperation is the development of analytic and statistical criteria in order to extract and analyze complex motifs in sequences and to use these criteria on entire genome sequences as well. This includes the development of methods for complex motifs and combined motifs identification in the genomes, analytic and numerical approaches to asess the statistical significance of candidates and an experimental verification of putative motifs. Our main application is the analysis of regulatory regions in eucaryote organisms, such as the man, the mouse and insects. A special attention is paid to promoter sequences and to CpG islands in genes that control the tissue differentiation and tumorogenesis. In this project, Amibmembers bring their skills and tools in pattern matching algorithms and (probabilistic) combinatorial enumeration. Such results are complementary to the genome analysis technology developped at NII-Genetika, that includes genomic databases organisation, databases creation for functionnally important regions and data integration from different sources in biology and bioinformatics. This associate team takes place in a long history of collaboration between Moscow and Inria groups, that also includes biologists from Berkeley.
Professor D. Frishman and S. Neuman ( Mips, Munchen) visited Amibduring two days and one week, respectively. Professor V.Makeev ( NII-Genetikaand Migec) did a one week visit, and E. Furletova ( MigecPHD student) visited three times during two weeks. Professor M. Ward (Purdue University) did a one week visit and A. Sim (Stanford University, Associate team Gnapi) did a two weeks visit. Professor R. Backofen (Heidelberg) did a two day visit. Professor N. Leontis (Bowling Green State University) did a one week visit.
All team is involved in GDR-BIM (Biology, Computer Science and Mathematics). A. Denise has been the head of this GDR since 2006, Ch. Froidevaux was in charge of subdomain Knowledge Representation, Ontologies, Data Integration and Gridsand J. Azé is the webmaster.
The Programme PluriFormation, PPFBioinformatics and Biomathematics, headed by Ch. Froidevaux, gathers teams of computer scientists, mathematicians, and biologists from the University of Paris Sud-XI interested in bioinformatics and biostatistics. All the team is involved and participated in the final workshop at Tours, (September, 14th-15th).
Our seminar is held three times a month. This fall, we welcomed a seminar by B. Behzahdi (Google Research), A. Sim (Stanford University), S. Neuman ( Mips, Munich), M. Ward (Purdue University), N. Leontis (Bowling Green State University), F. Leclerc (Nancy University).
P. Amar was invited to give the talk Modelling self assembly and behaviour of molecular complexesat the Workshop on “MAS in Biology at the meso or macroscopic scales” in Paris on June, 23rd.
J. Bernauer was invited to give a talk on "Computational Structural Biology: Periodic Triangulations for Molecular Dynamics" at the Workshop "Subdivide and tile: Triangulating spaces for
understanding the world", organized in Leiden (Netherlands), 16-20 November, 2009. See
http://
Y. Ponty was invited to give the talk RNA as a combinatorial object: Asymptotics of RNA Shapes" at the bioinformatics seminar (hosted by R. Backofen) of the Technical university of Freiburg on November, 27th.
Thuong Van Du Tran attended Mccmb'09(Moscow, Russia) and Ismb/Eccb2009(Stockholm, Sweden) and presented posters.
P. Amar was a program committee member and scientific committee member of the conference Modelling Complex Biological Systems in the context of genomics.
J.Bernauer is chair of Multi-resolution Modeling of Biological Macromoleculessession at the Pacific Symposium on Biocomputing 2010
S. Cohen-Boulakia was a program committee member of international conferences or workshops Ssdbm2009, Dils 2009, Swpm-2009(First Int. Workshop on the role of Semantic Web in Provenance Management, co-located with Iswc-2009), Icde 2010(general track and demo track) and of national conferences Bda2009, Jobim2010.
Ch. Froidevaux was a program committee member of Edbt2010, Ib2010, Dils2009, IEEE Cbms2009(Computer-Based Medical Systems-special track on Computational Proteomics-), Third Int. Workshop on "Biomedical and Bioinformatics' Challenges to Computer Science" co-located with ICCS(2009 et 2010) and of national conferences, Egc2009, Egc2010, Jobim2009.
Ch. Froidevaux and S. Cohen-Boulakia organized workshop Metadata, Ontologies and Quality of Annotation, Moqa(september, 27th).
M. Régnier is a program committee member of Recombworkshop on Regulatory Genomics and co-organized Mccmb'09in Moscow.
A. Denise serves in the National Committee of Scientific Research: section 7, Sciences et Technologies de l'Information (Computer Science, Control, Signal and Communication) and interdisciplinary commission 43 (Modélisation de systèmes biologiques, bioinformatique).
Ch. Froidevaux has been the head of Computer Science Department at University Paris-Sud XI (UFR des Sciences d'Orsay) since January, 15th. She participated to the AERES committee that evaluated InriaLille Nord-Europe CRI.
M. Régnier serves in the Committee of French ANR
http://
The Master of Bioinformatics and Biostatistics of University Paris-Sud (
http://
M. Régnier has been invited by Al Farabi University (Almaty, Kazakhstan) to deliver a 20 hours master course in bioinformatics. She serves in the Committee of French Agregation of Mathematics (Computer Science option).
J. Bernauer teaches at AgroParisTech, Paris, MAP3 (3h) and at University of Nice - Sophia-Antipolis, Master of Science in Computational Biology; Algorithmic Problems in Computational Structural Biology(9h).
C. Saule is a teaching assistant at Orsay UFR ( Internet programming, Engineering software, Data basesand Java). Philippe Rinaudo is a teaching assistant for Programmation principles and languages(Master 2 CCI) and Algorithmics and complexity in biology(Master 1 BIBS). Van Du Tran teaches a course on Algorithm and Complexityand a course on Javain L3 at Orsay.