Protein Threading Problem: From Mathematical Models to Parallel Implementations

Symbiose SYstèmes et Modèles BIOlogiques, BIOinformatique et SEquences BIO Jacques Nicolas INRIA Chercheur

Rennes

CR1 Inria Marie-Noëlle Georgeault INRIA Assistant

Rennes

AI Inria Rumen Andonov UnivFr Enseignant

Rennes

Professor, Univ. Rennes 1 oui Catherine Belleannée UnivFr Enseignant

Rennes

Associate Professor, Univ. Rennes 1 François Coste INRIA Chercheur

Rennes

CR1 Inria Dominique Lavenier CNRS Chercheur

Rennes

DR2 Cnrs oui Michel Le Borgne UnivFr Enseignant

Rennes

Associate Professor, Univ. Rennes 1 Israël-César Lerman UnivFr Enseignant

Rennes

Emeritus Professor, Univ. Rennes 1 oui Ovidiu Radulescu UnivFr Chercheur

Rennes

Associate Professor, delegationuntil Oct. 2007, external collaborator afterwards oui Anne Siegel CNRS Chercheur

Rennes

CR1 Cnrs Basavanneppa Tallur UnivFr Enseignant

Rennes

Associate Professor, Univ. Rennes 1 oui Raoul Vorc'h UnivFr Enseignant

Rennes

Associate Professor, Univ. Rennes 1 Laurence Duval UnivFr CollaborateurExterieur

Rennes

Assistant professor, ENSAI, Bruz Stéphane Rubini UnivFr CollaborateurExterieur

Rennes

Assistant professor, Univ. Bretagne Ouest Pierre Peterlongo INRIA PostDoc

Rennes

ARC FLASH Pierre Blavy AutreEtablissementPublic PhD

Rennes

INRA Guillaume Collet UnivFr PhD

Rennes

MENRT Matthias Gallé INRIA PhD

Rennes

student intern between May 2007 and Nov., Inria/CORDI since Nov. 2007 Jérémy Gruel AutreEtablissementPublic PhD

Rennes

Inserm/Région Carito Guziolowski INRIA PhD

Rennes

Conicyt/Ambassade de France/Inria Thibaut Henin UnivFr PhD

Rennes

ENS/MENRT Goulven Kerbellec INRIA PhD

Rennes

Inria/Région Noël Malod-Dognin INRIA PhD

Rennes

Inria/Region Van Hoa Nguyen INRIA PhD

Rennes

Inria/CORDI Guillaume Rizk UnivFr PhD

Rennes

MENRT since Sept. 2007 Sébastien Tempel UnivFr PostDoc

Rennes

ATER until Oct. 2007 Philippe Veber INRIA PhD

Rennes

Inria until Nov. 2007 Hugues Leroy INRIA Technique

Rennes

IR Inria Olivier Collin CNRS Technique

Rennes

IR Cnrs since Sept. 2007 Anthony Assi INRIA Technique

Rennes

IE (Inria contract genopole) Annabel Bourdé INRIA Technique

Rennes

IE (Inria contract genopole) since June 2007 Olivier Filangi INRIA Technique

Rennes

IE (Inria contract genopole) Laetitia Guillot INRIA Technique

Rennes

IE (Inria contract genopole) Gregory Ranchy INRIA Technique

Rennes

IE (Inria contract genopole) until July 2007 Sophie Roucan INRIA Technique

Rennes

IE (Inria contract genopole) Alexandre Cornu INRIA Technique

Rennes

IE (ANR contract Para) since Oct. 2007 Gilles Georges INRIA Technique

Rennes

IE (ANR contract PARA) until June 2007 Julien Jacques INRIA Technique

Rennes

European project ACGT Christine Rousseau INRIA Technique

Rennes

IE (ANR contract Modulome) François Moreews AutreEtablissementPublic Technique

Rennes

INRA, national program Sigenae Fabrice Legeai AutreEtablissementPublic Technique

Rennes

INRA since July 2007 Nicola Yanev UnivEtrangere Visiteur

Rennes

2 months, project RILA and ANR PROTEUS Ricardo Ascensio UnivFr Stagiaire

Rennes

SUPELEC/D. Lavenier Laurent Chamoin UnivFr Stagiaire

Rennes

M2 Bioinformatique /A. Siegel Sarah Dagher UnivEtrangere Stagiaire

Rennes

ESIB-USJ, Liban/ A. Assi Tony Dang UnivEtrangere Stagiaire

Rennes

National University of Singapore, Singapore/A. Assi, O. Collin, H. Leroy Corentin Duperreray UnivFr Stagiaire

Rennes

IFSIC/A. Assi Jessie Mahé UnivFr Stagiaire

Rennes

INSA Rennes/F.Coste Debasis Panda UnivEtrangere Stagiaire

Rennes

6 months, Internship INRIA, IBAB Bangalore/O. Radulescu Karim Tazibt UnivFr Stagiaire

Rennes

M1 Bioinformatique /F. Moreews Xianchyn Ye UnivEtrangere Stagiaire

Rennes

6 months, Internship INRIA, Institute of Computing Technology, Chine/D. Lavenier

The Symbiose project has been created in 2002. Its general purpose concerns bioinformatics, that is, modelling and analysis of large scale genomic and post-genomic data. Our goal is to assist the molecular biologist for the formulation and discovery of new biological knowledge from the information gained through public data banks and experimental data. This project is thus clearly application-oriented and combines multiple research fields in computer science towards this goal.

Overall Objectives A Bioinformatics center

Bioinformatics has a quite large acceptation and we first delimit the definition we use in our framework: it specifies research at the interface between computer science and molecular biology and not all "standard" informatics that is necessary to manage biological data on a daily basis. However, it is hard to achieve in depth research in this domain without "biocomputing", that is, participating to services of the second kind with biologists. In order to favour cooperative studies with biological labs we have decided to create a Bioinformatics Center, with a research team, Symbiose, leaning back against a bioinformatics platform GenOuest(or the converse...). This report is mainly focused on the research project. The activity of the bioinformatics platform being essentially an activity of development and transfer, it is described in the section .

Our research specificities include our interest in large scale studies(genomes, proteomes or regulation networks) and discrete methodsnecessary to handle the associated complexity. Our methods relate on discrete optimization, analysis of systems of qualitative equations and formal language modelling. Our goal is to push forward the range of applicability of these methods, by exploring the impact of specialized machines or algorithms.

Scientific axes

The Scientific axeson which the project focuses derive from our choice of modelling complex biological systems in a discrete framework, while managing efficiency issues. More precisely, the project links together three main directions of research:

Modelling sequence/structure relationships

This track concerns the search for relevant (e. g. functional) spatial or logical structures in macromolecules, either with intent to model specific spatial structures (secondary and tertiary structures, disulfide bounds ... ) or general biological mechanisms (transposition ... ). In the framework of language theory and combinatorial optimization, we try to answer four types of problems: the design of grammatical models on biological sequences; efficient filtering and model matching in data banks; protein structure prediction; and machine learning of grammatical models from sequences. Corresponding disciplinary fields are algorithmic on words, machine learning, data analysis and combinatorial optimization.

Systems biology: network modelling and analysis

The ultimate goal, for the biologist, is to explain how the combination of genetic and metabolic interactions determines the phenotype which is observed at the molecular level, particularly in case of diseases. The scarcity of quantitative data on biological phenomena implies the use of qualitative models. Our approach is based on the definition of graph models of biological networks and the derivation of discrete or differential models for explaining and predicting (in a broad meaning) the behavior of the biological system. A special attention is paid to the diagnosis of large scalemodels described by their interaction graph.

Optimized algormths on parallel specialized architectures

We investigate the practical usage of parallelism to speed up computations in genomics. Topics of interest range in intensive sequence comparisons to pattern or model matching, including structure prediction. We work on the codesign of algorithms and hardware architectures tailored to the treatment of such applications. It is based on the study of reconfigurable machines employing Field Programmable Logical Arrays (FPGA) or fast components such as flash memories or Graphical Processing Units.

Highlights of the year

Three PhD thesis are ending this year.

S. Tempelhas worked on transposons, mobile genetic elements that are the main source of variation in many genomes. The thesis proposes a general method to split a family of related DNA sequences into repeated domains. It led to a new nomenclature of helitrons, the most widely spread transposable element in A. thaliana, which has been published in Gene. S. Tempel has currently a post-doctorate position at Mountain views, CA, in the laboratory of J. Jurka, responsible of one of the main international bank on repeated genomic elements, Repbase.

P. Veberhas worked on the modelling of large scale regulation networks. He particularly addressed the issue of diagnosing incompatibilities between expression data and a signed graph of cellular regulations. A solution has been developed based on Answer Set Programming that allowed considering up to several thousands genes and regulations on E.Coliand Yeast data. This study demonstrates the great usefulness of this framework for providing a consistency check, proposing corrective actions and deriving new predictions. It is accepted for publication in BMC Bioinformatics. P. Veber has currently a post-doctorate position at Potsdam, Germany, in the laboratory of T. Schaub, a leading team in Answer Set Programming.

G. Kerbellechas elaborated a grammatical inference method for modelling functional or structural families of proteins. Given a sample of sequences belonging to the family (learning from positive instances only), the method relies on a characterization step, which can be formalized as a search under constraints of the best partial local multiple alignment of the sequences (PLMA), and a generalization step, in which the PLMA is used to infer non-deterministic finite state automata according to the Minimum Description Length principle. It is the first operational grammatical inference method available for biologists to date.

The staff of the bioinformatics platform GenOuest has been greatly reenforced by two new research engineers' positions. GenOuest is now labelled as a national RIO platform and as a platform of GIS IBiSA. It has offered this year many new tools that result directly from transfers from research prototypes, a very good achievement with respect to its mission to provide innovative software in bioinformatics. Particularly, each of the previously described thesis has been extended by a software package accessible on the web server and connected to the main databanks. Domain organizerproposes a synthetic view of a set of related DNA sequences by providing both a segmentation of them into domains and a classification on the basis of these domains. Bioqualiis a python plug-in dedicated to the validation and the prediction of qualitative data coupled with a biological network represented as an interaction graph. ProtomataLearneris a software package for the inference of finite state automata on protein sequences. It provides an interesting tool for the discovery of new proteins from sequences that deciphers both the active positions in proteins of the studied family and its possible subclasses.

Scientific Foundations A quick introduction to bioinformatics

Studying life at macromolecular level (DNA, RNA, protein or metabolic molecules) supposes a number of studies of different kinds (for more information, see for instance the introductory part of ) :

Data collecting.The main unsolved research issue is the reconstruction of a sequence from its fragments after sequencing and/or mass fingerprinting. Finishing an assembly remains a hard task. There exists a renewal of interest in this area due to the multiple sources of data and to the raise of metagenomics (considering several genomes simultaneously).

Data and Knowledge management.It is actually a major issue. Information is produced in a highly distributed way, in each laboratory. Normalization of data, structuring of data banks, detection of redundancies and inconsistencies, integration of several sources of data and knowledge, extraction of knowledge from texts, all these are very crucial tasks for bioinformatics.

Analysis of similarities/differences.Referring to a set of already known sequences is the most important method for studying new sequences, in the search for homologies. The basic issue is the alignment of a set of sequences, where one is looking for a global correspondence between positions of each sequence. A more complex issue consists in aligning structures. More macroscopic studies are also possible, involving more complex operations on genomes such as permutations. Once sequences have been compared, phylogenies, that is, trees tracing back the evolution of genes, may be built from a set of induced distances. A more recent track considers Single Nucleotide Polymorphism data, which correspond to mutations observed at given positions in a sequence with respect to a population. Analysing this type of data and relating them to phenotypic data leads to new research issues.

From structural analysis to systems biology.This large domain aims at extracting biological knowledge from Xome studies, where X varies from genes to metabolites. Biological sequences and networks of components in the cell must verify a number of important constraints with respect to dynamic stable and accessible conformations, and functional mechanisms. These constraints result in the conservation during evolution of "patterns" and types of interactions to be deciphered. The main point becomes the study of life as a system, abstracted in a network of components governed by interaction laws.

Syntactical Analysis of sequences Machine Learning Grammatical Inference Logic Grammars Pattern Matching Pattern Discovery Data Analysis Formal Languages and biological sequences

Sequences are considered as words on an alphabet of nucleic or amino acids. The set of superimposed structural and functional constraints leads to the formation of a true language whose knowledge would enable to predict the properties of the sequences. The theory of languages formalizes the basic concepts underlying the studied phenomena (degree of expressivity, complexity of the analysis, associated automata, algebra on languages). Still very few authors have explored this paradigm. It can be studied from two points of view:

A fundamental point of view, where the goal is to define and study the most adapted classes of formal languages for the description of observed natural phenomena. The splicing systems of Head , or H-systems, reproducing the phenomenon of crossing over, represent one of the most fertile formalism in this respect. Language theorists like A. Salomaa and Gh. Paun also explored standard questions (complexity, decidability, stable languages, etc) when faced with natural operations on biological sequences (inversion, transposition, copy, deletion, etc) and proposed in particular a model called Sticker-system based on the operation of complementarity as it occurs in Watson Crick pairings . The current agreement is that the necessary expressivity is the class of "mildly context sensitive" languages, well-known in natural language analysis . The most complete work in this field seems due to D. Searls , ;

A more practical point of view, where the goal is to provide to the biologist the means of formalizing his model using a grammar, which submitted to a parser will then make it possible to extract from public data banks relevant sequences with respect to the model. J. Collado Vides was one of the first interested in this framework for the study of the regulation of genes . D. Searls proposed a more systematic approach based on logical grammars and a parser, Genlang . Genlang remains still rarely used in the community of biologists, probably because it requires advanced competences in languages. We started our own work from this solution, keeping in mind the need for better accessibility of the model to biologists.

In practice, the biologist is often unable to provide sufficient models. To assist him in building relevant models necessitates the development of machine learning techniques.

Pattern Discovery

Because of its practical importance and the increasing quantity of available data, a number of pattern discovery methods have emerged since a few years. Particularly, due to the massive production of expression data from DNA chips, lots of papers have been proposed on pattern discovery in promoter sequences. Reviews of the field are available in or . The first criterion to classify methods is the type and expressivity of patterns they look for. One can primarily represent a language either within a probabilistic framework, by a distribution on the set of possible words, or within a formal languages framework, by a production system of the set of accepted words. At the frontier, one finds Hidden Markov Models and stochastic automata, which have very good performances, but where classically the structure is fixed and learning is achieved on the parameters of the distribution. Distributional representations are expressed via various modalities : consensus matrices (probability of occurrence of each letter at each position), profiles (taking into account gaps), weight matrices (quantity of information at each position and contribution of each letter). At the algorithmic level, alignments play a fundamental role. One scans for short words in the sequences, then alignments are carried out by dynamic programming around these "anchoring" points. The production of "blocks" is typical of this approach . Most powerful programs in this field are currently bayesian procedure building a consensus matrix by Gibbs sampling with organism-specific higher order models (Markov chain) for prior frequencies estimate , Motif Sampler, and Meta-Meme, building a Markov network combining such matrices, produced by EM (Expectation-Maximization) algorithm.

The linguistic representation, which corresponds to our own work, generally rests on regular expressions. Algorithms use combinatorial enumeration in a partially ordered space. Among the most applied in this field, one finds the Pratt program , using principles very close to those found in the work of M.-F. Sagot and A. Viari . Another track explores variations on the search for cliques in a graph , .

Even if results obtained so far are interesting in a number of cases, we think that there is a fundamental limitation to current studies: it is primarily the presence at a given position of some class of letters which will lead to the prediction. However it is clear that relations exist between various sites – sometimes distant on the sequence – and play an important biological role. A purely statistical learning seems to have reached its limits here, because of the multiplication of parameters to be adjusted. The theoretical framework of formal languages, where one can seek to optimize this time the complexity of the representation (parsimony principle), seems to us more adapted. We are engaged in this research track.

Machine Learning and Grammatical Inference

Machine Learning is a research field devoted to studying the design and analysis of algorithms for making predictions about the future based on past experiences. Taking roots in Artificial Intelligence and Statistics, it focuses on the study of learning algorithms inspired as well by a cognitive view of natural learning from experience as by statistical techniques for fitting model parameters to data. Research is achieved from a theoretical point of view (Computational Learning Theory), studying learnability criteria and learnable classes, and from a more practical point of view (applied Machine Learning), focusing more on the algorithms and their performances. Recent techniques mix both points of view, like for example, boostingtechniques (allowing good performances from initial weak learner) or the development of support vector machines(applying structural risk minimization principle from statistical learning theory). Integrating statistical tools is a growing trend: one can cite reinforcement learning, classification or statistical physics and also research in neural networks or hidden Markov models (HMM). The problem of comparing and integrating these symbolic and numerical approaches has been extensively studied . Hidden Markov models are ubiquitous in bioinformatics. They contain the mathematical structure of a (hidden) Markov chain with each state associated with a distinct independent and identically distributed (IID) or a stationary random process. Estimation of the parameters following maximum likelihood or related principles has been extensively studied and good algorithms relying on dynamic programming techniques are now available. When available, domain knowledge may help to design empirically a structure but, in practice, it is often very simple (e.g. Profile HMM) and the discriminative power of HMM relies essentially on its parameter choice.

In the Symbiose project, we are studying this problem in the more general framework of Grammatical Inference. Grammatical Inference, variously referred to as automata induction, grammar induction, and automatic language acquisition, refers to the process of learning grammars and languages from sequences. Let us notice that the emphasis is not only on learning language (i.e. a set of sequences) but also on learning grammars (i.e. structural representations of the sequences of the language).

A grammatical inference problem is an optimization problem involving the choice of a) a relevant alphabet and a class of languages; b) a class of representations for the languages and a definition of the hypothesis space; c) a search algorithm using the hypothesis space properties and available bias (knowledge) about the domain to find the “best” solution in the search space.

State of the art in grammatical inference is mostly about learning the class of regular languages (at the same level of complexity than HMM structures) for which positive theoretical results and practical algorithms have been obtained. Some results have also been obtained on (sub-)classes of context-free languages . In the Symbiose project, we are studying more specifically how grammatical inference algorithms may be applied to bioinformatics, focusing on how to introduce biological bias and on how to obtain explicit representations.

Our main focus is on the inference of automata from samples of (unaligned) sequences belonging to a structural or functional family of proteins. Automata are graphical models representing a (potentially infinite) set of sequences. They can be used to get new insights into the family, when classical multiple sequence alignments are insufficient, or to search for new family members in the sequence data banks, with the advantage of a finer level of expressivity than classical sequence patterns (such as PSSM, Profile HMM, or Prosite Patterns) permitting to model heterogeneous sequence families.

Modelling and analysing genetic networks Biological context

Once genes and proteins have been identified, it becomes possible to deal with integrative biology: how is genetic information used so that a given organism is able to develop and survive? Differences on a single gene may explain some simple (or Mendelian) characters as monogenetic diseases, color phenotypes, etc. However, a major part of phenotypic characters derive from the combined action of many genes. These interactions lead to complex genetic models for phenotypic characters, especially if one takes into account the influence of the environment on the character.

Networks are natural models for gene interactions: they appear to be abstract enough to be formalized while enabling to represent the complexity of a biological organism. In this framework, dynamics is essential: an organism cannot be understood without considering its development; similarly, the functions of a network cannot be separated from its dynamics.

Technically, this global point of view is motivated by the emergence of new high throughput techniques (DNA chips for gene activity, Chip on Chip for DNA/protein interactions, mass spectroscopy for protein interactions). allowing massive observations on the whole cell of tissue level.

Systems biology: models and data

The field of systems biologyappeared as a response to increasing needs for analytical approaches in molecular biology. Its goals include modelling interactions, understanding the behaviour of a system from the interplay of its components, confronting the prediction of the model to data, and inferring models from data. Solutions to these challenges are often interdisciplinary.

Cellular interactions' modelling is an old domain in biology, initiated by people interested in the dynamics of enzymes systems . Models for genetic networks appeared as soon as gene interactions were discovered. The simplest static model consists in an oriented labeled graph, with labels + for activation or - for inhibition. Such graph representations are used to store known interactions in general databases. They are also the framework of Bayesian representations, used to infer gene networks from micro-array data. However, this technique appears to be incomplete without the support of literature information .

The dynamical frameworkincludes simulations and prediction of behaviours; models can be either qualitative or quantitative, as reviewed in , , . A first approach makes use of continuous models: the concentrations of products are modeled by continuous functions of time, governed by differential equations. This framework allows one to state biological properties of networks, eventually by using simulation software , , . The properties of continuous models can be studied with convex analysis, linear and non-linear control techniques , , , . Stochastic models transform reaction rates into probabilities and concentrations into numbers of molecules, allowing to understand how noise influences a system . Finally, in discrete models each component is assumed to have a small number of qualitative states, and the regulatory interactions are described by discrete functions. Relevant discrete frameworks can be boolean , , logical , , or Petri networks . Piecewise linear differential models , try to build a bridge between continuous and discrete models.

Each of these methods addresses in complementary ways dynamical properties such as the existence of attractors (limit cycles or steady states) and the behavior of these with respect to changes in the parameters , . They represent powerful tools to acquire a fine grained knowledge of the system at hand, but they need accurate data on chemical reactions kinetics or qualitative information. These data are scarcely available. Furthermore, these methods are also computationally demanding and their practical use is restricted in practice to a small number of variables.

Model identificationaddresses a different objective, that is, to build or update a model consistently with a set of data. When large amounts of data are available, models can be built using statistical techniques such as Bayesian networks or kernels . Another efficient approach formalizes a priori knowledge as partially specified models. Fitting models to data is obtained by means of various techniques, depending on the class of models, that can be discrete , , continuous , or hybrid . Qualitative reasoning, hybrid system, constraint programming or model-checking allow either to identify a subset of active processes explaining experimental time-series data , or to correct the models and infer some parameters from data , . Identification methods are limited to a few dozen components. Model correction or parameter regression can cope with up to hundreds of products provided that the biomolecular mechanisms and supplied kinetic data are accurate enough.

Qualitative data

Qualitative data such as DNA microarrays data cannot be easily used in most of the frameworks described above for two main reasons. First, the model-based identification approach has difficulties to take into account the errors and the variability that commonly affect measured expression levels in DNA microarrays. Secondly, time series data are absent in many situations (for instance disease studies on clinical tissues), meaning that they inform more on steady state shifts under perturbations than on the dynamics of the system.

The philosophy of our project is to develop techniques around network modelling, using models adapted to the kind of observations available with the biological techniques at hand. The methods we develop have two characteristics:

Our models integrate simultaneously a biochemical (metabolic or signalling) component and a genetic component. Indeed, in pluricellular organisms, genetic actors are activated in the framework of complex metabolic or signaling pathways that have their own dynamics. It is our goal to better understand the relations between these two components.

We follow a qualitative modelling approach, using either discrete event networks or qualitative equations derived from differential models.

Optimal algorithms or parallel specialized architectures parallel architectures grids dedicated architectures reconfigurable architectures

Mixing parallel computing and genomics is both motivated by the large volume of data to handle and by the complexity of certain algorithms. Today, (october 2007) about 650 genomes – including the human genome – are completely sequenced, and there exist more than 3000 other sequencing projects (see Genomes online database http:// www. genomesonline. org/ ). All these data are stored into huge data bases whose volume approximatively doubles every year. The growth is exponential and there is no reason to expect any decline in the next few years since the cost of sequencing is greatly decreasing and new needs appear such as close species sequencing for a better isolation of important conserved structures or metagenomics for ecological studies.

The problem is to efficiently explore these banks, and extract relevant informations. A routine activity is to perform content-based searches related to unknown DNA or protein sequences: the goal is to detect similar objects in the banks. The basic assumption is that two sequences sharing any similarities (identical characters) can have some related functionality. Even if this axiom may not be true, it can give precious clues for further investigations.

The first algorithms for comparing genomic sequences have been developed in the seventies. They were essentially based on dynamic programming technics , . Then, with the increasing growth of data, faster algorithms have been designed to drastically speed-up the search. The Blast software acts now as a reference to perform rapid searches over large data bases. But, in spite of its short computation time (compared to the first algorithms) a growing number of genomic researches require much lower computation time. Parallelizing the search over large parallel computers is a first solution. The LASSAP software developed by JJ Codani, Inria has been designed in that direction: it parallelizes a standard suite of bioinformatics tools dedicated to intensive genomic computations.

Other ways of research have also been investigated to speed-up the search in large genomic banks, in particular dedicated hardware machines. Several research prototypes such as SAMBA , BISP , HSCAN or BioScan , have been proposed, leading today to powerful commercial products: BioXL, DECYPHER and GeneMatcher coming respectively from Compugen ltd. TimeLogic and Paracel http:// www. compugen. co. il/ , http:// www. timelogic. com, http:// www. paracel. com.

Beyond the standard search process, this huge volume of available (free) data naturally promote new field of investigation requiring much more computing power such as, for example, comparing a set of complete genomes, classifying all the known proteins (decrypton project), establishing specific databases (ProDom), etc. Of course, the solutions discussed above can still be used, even if for 3-4 years, new alternative has appeared with the gridtechnology. Here, a single treatment is distributed over a group of computers geographically scattered and connected by Internet. Today, a few grid projects focusing on genomics applications are under deployment: the bioinformatics working group (WP 10) of the European DataGRID project; the BioGRID subproject from the EuroGRID project; the GenoGRID project deploying an experimental grid for genomics application; the GriPPS (Grid Protein Pattern Scaning) project.

Note that the large amount of genomic data is not the only motivation for parallelizing computations. The complexity of certain algorithms is also another strong motivation, especially for the analysis of structures in sequences [BMW03]. For instance, predicting the 3D structure of a protein from its amino acid sequence is an extremely difficult challenge, both in term of modelling and computation time. The problem is investigated following many ways ranging from de novofolding prediction to protein threading technics . The underlying algorithms are NP-complete and require both combinatorial optimization and parallelization approaches to calculate a solution in a reasonable amount of time.

For the last 2-3 years, GPU boards (Graphical Processing Units) have seen their computational power highly increasing. They now become a real alternative for deporting very time consuming general purpose computation. This activity is refered as GPGPU, standing for General-Purpose computation on GPUs. Many bioinformatics algorithms present interesting features allowing them to provide efficient parallelization. In 2007, we have started investigating the potentiality of this hardware support on several basic bioinformatics algorithms.

Application Domains Application Domains "life sciences" biology diagnostics genomics "target discovery" health

The main stakes of bioinformatics are to assist biologists in the processes of discovering prognostic, diagnostic and therapeutic targets and the understanding of biological mechanisms. This covers in practice a great variety of works.

The local context of Ouest-genopole provides us with a lot of collaborations with biology laboratories. We emphasize here three types of applications with major achievements in the project.

Targeted gene discoveryis studied with a syntactical approach. Models are built for proteins or promoters and then searched in whole genomes. We have for instance been able to discover new beta-defensins, a family of anti-microbial peptides, in the human genome with such a strategy.

Whole genome analysisis made practical through dedicated data structures and reconfigurable architectures. We have for instance proposed Blast comparisons on the human genome in 1 minute, built a software for bacterial genome fragmentation, GenoFrag, that helps to study genomes variations via Long Range PCR, and studied the occurrence of retro-transposons, a family of mobile genomic units, in the genome of Arabidopsis thaliana.

Genomic/metabolic interaction networksare modeled in eukaryote organisms. We consider large scale models described by their interaction graphs. As an application, we are studying genes and metabolites involved in the lipogenesis (chickens) and in TGF-beta-regulation in association with hepatocellular carcinomas (human). We test our analysis algorithms on the transcriptional regulation networks of E. Coliand S. Cerevisiae.

Software Introduction

The software section of Symbiose needs a short introduction for a better understanding of the specific organization we have set up in order to favour the scientific relations between biologists and computer scientists. Most of the prototypes built during our researches are transferred on the bioinformatics platform GenOuest for further development and integration in a suitable environment for biologists. However, GenOuest has its own activity in relation with the service it has to offer and shares also studies with INRA bioinformatics platforms. This section contains three parts in accordance with this organization scheme :

the first subsection introduces the context of GenOuest and general elements of its activity;

the second subsection presents the main results of the platform for this year on its own projects and in collaboration with Inra;

the last section presents the yearly activity of transfer of the platform, in conjunction with Symbiose.

The activity of service of GenOuest, the Bioinformatics computing center of Ouest-genopole Olivier Collin Hugues Leroy Anthony Assi Annabel Bourdé Laetitia Guillot Olivier Filangi Grégory Ranchy Sophie Roucan Sarah Dagher Tony Dang Corentin Duperray

The bioinformatics platform GenOuest acts as a facility and software tool provider for the analysis of genomic data generated by numerous laboratories (55) of OUEST-genopole®. The platform provides at first computing power but also a comprehensive list of software dedicated to sequence analysis. On a national level, the platform is developing an expertise in the field of pattern matching and pattern discovery tool. This activity results from a tight interaction with the Symbiose research team. This platform first received a national RIO label in 2003. This label has been renewed in 2006. The platform is supported by several contracts : CNRG 2005 and 2006, Région Bretagne 2006 and 2007.

One of the main characteristics of the platform is a strong interaction with biological labs. This leads to a fruitful partnership where new tools can originate from direct demands to the platform. Fostering this partnership requires a strong implication of the GenOuest engineers who participate to numerous meetings with end users. The platform plays an important role in the technological transfer process of new tools elaborated by the three bioinformatics research teams of OUEST-genopole® (Symbiose; LINA - Nantes; LERIA - Angers). These tools are transferred on the platform and online access via the web server is provided for the whole scientific community.

Main evolutions of the year have been:

The GenOuest platform has initiated the first steps of the implementation of a quality management system (QMS) since 10 months. The initial goal is to pass the ISO 9001 certification for the first semester of 2008.

The GenOuest platform has proposed training sessions attended by a broad audience ranging from beginners to more accustomed users of bioinformatics tools. The subjects of the training sessions were sequence analysis, motif discovery, phylogenetics. More than 120 persons have attended the training sessions.

Since its creation, the platform organizes an annual meeting including technical conferences on the platform's achievments but also invited speakers that give the opportunity to discover new organizations (other bioinformatics plateforms), new technologies (softwares), or new scientific advances in bioinfomatics.

The platfom is involved in different coordination activities at a national level. H. Leroy is a member of the ReNaBi (Reseau National des plates-formes Bio-informatiques) steering board since 2005. The GenOuest platform is in charge of the BioMAJ project, a joint project with INRA Jouy and INRA Toulouse.

Bioinformatics services hosting

The GenOuest bioinformatics platform is hosting bioinformatics services developed by research teams who are not able to provide the required computing power. Among these services are

Autograph, an interactive web server for automating and visualizing comparative genome maps

Germonline, a cross-species knowledge base providing microarray data relevant for the cell cycle and gametogenesis

MIPDB, a relational database of All Major Intrinsic Proteins of Archaea, Bacteria and Eukaryota.

M@ia, a tool dedicated to micro-array data analysis.

Databases maintenance

Genomic databases, including complete genomes such as the human genome, have been set up in an effort to help biologists in their research. Most of these databases are publicly available for consulting. Between two major releases, minor updates and corrections are also downloaded and installed in order to maintain up-to-date databases. Databases and tools are accessible on the web http:// genouest. org/ .

Test framework for web platform

A bioinformatics Web Platform gives access to specific services for a community of bioinformaticians and biologists. We need to reduce repetitive consuming test tasks needed to guaranty the quality of the services provided by bioinformatics portal.

Based on existing testing frameworks, we have developed a prototype of an automated system and functional test engine with various functionalities (service availability, scheduling, evaluation, load test...).

Toolbox

The toolbox http:// genouest. org/ tools. phpprovides access to standard bioinformatics tools (e.g. EMBOSS package) and adapted softwares tailored to OUEST-genopole biologists needs. Besides these generic tools, some custom services are provided. For example, one of the most recurrent demand is the possibility to make a Blast comparison against a personal bank. This tool allows to perform a more relevant and faster search in this context. The tools are accessible by two ways: an anonymous way (with the web server) and an authenticated way (direct access to the computation cluster).

Generation of primers

The accurate determination of primers is of a key importance for the biologist carrying experiments. The platform has setup a set of tools dedicated to this kind of analysis http:// genouest. org/ primers. php.

GenoFrag.The goal of GenoFrag is to deal with Whole Genome PCR Scanning (WGPS), a means for analysing bacterial genome plasticity. This software is developed for the design of optimized primers for Long-Range PCR on whole genomes. GenoFrag initially seeks all the potential primers on a chromosome. Then it calculates the best distribution of the primer pairs, thanks to combinatorial optimization algorithms. It was tested on Staphylococcus aureusstrains but can be used for other bacterial or viral species , . A graphical interface is present on the Ouest-genopole bioinformatics platform server http:// genouest. org/ genofrag. php. GenoFrag helps to design very good primers for PCR, thus avoiding checking primers and PCR conditions. This software is dedicated to biologists interested in bacterial genome variability analysis.

CAPS Tags(Cleaved Amplified Polymorphic Sequence). virtually digests the two sequences with the Emboss restrict program, align them with Multalign and display single enzyme cuts, taking into account the gaps appeared in the alignment. Differences are validated with the alignment, in this case a difference is a potential SNP.

Degenerate primers. A way to look for new genes is to use degenerate primers. Data are a set of protein sequences, from different species, with the same biological function. They are aligned with Multalign and the consensus of longest fragments with few ambiguous amino acids are degenerated from the 3' end. We have developed a module, working with codon usage tables that reverse translate protein sequences in nucleic sequences with a bounded degeneration cost.

Microsatellite primers.Microsatellites are shortly repeated sequences that are primers markers in genome mapping. We use Sputnik to find microsatellites of chosen length in these sequences. Then we try to design PCR primers in the sequences containing a microsatellite with primer3.

Pattern matching and pattern discovery

This activity is one strategic field where the platform is trying to develop its specificity on an international level. Our goal is to be recognized as an expertise resource center in this bioinformatics domain. Particularly, three algorithms have been developed in-house by the Symbiose research team. STAN is based on a suffix tree data structure. This tool scans complete genomes or user submitted sequences. The patterns are represented in the form of a grammar. WAPAM is a tool to parse for protein patterns expressed by weighted automata in banks (nucleic acids or proteins), complete genomes and personal sequences. The upgraded web interface of WAPAM allows to execute pattern searches on GenOuest servers.In both cases, the input patterns can be more complex than the usual regular patterns, such as PROSITE ones, since errors (substitutions and indels) and gaps of any size can be defined. In addition STAN provides string variables. An additional tool, Protomata-Scan, is available on the platform since September 2007. It is based on an extension of WAPAM named WASCAN which is able to handle the full expressiveness of automata. This program scans data banks with respect to automata generated by the Protomata-Learner program (see Section ).

The pattern discovery section of the web site proposes several pattern discovery algorithms http:// genouest. org/ patterndiscovery. php. To facilitate the interpretation and validation of results, we propose a toolbox with various modules: pattern matching in public databanks, visualization, statistical analysis, filtering. The implementation programming languages are Python, PHP and JavaScript.

New activities of the bioinformatics platform in 2007 Hugues Leroy Anthony Assi François Moreews Olivier Filangi National Project BioMAJ (BIOlogie Mise A Jour)

Biological knowledge, in proteomics and genomics context is mainly based on transitive bioinformatics analysis consisting in periodic comparison of data newly produced again corpus of known information. This approach needs on one hand accurate bioinformatics softwares, pipelines, interfaces... and on another hand numerous heterogeneous biological banks, which are distributed around the world.

A data integration process is obviously an essential preliminary step. This represents a major challenge and bottleneck in bioinformatics. These biological data banks contain a mass of heterogeneous data (all in different formats, several Tera bytes). These banks, after their recovery, must undergo various post treatments more or less personalized upstream of their use via various bioinformatics software (blast, SRS, emboss, gcg, ...). The banks frequency update is variable, according to the source, from daily to several times per year. Moreover, the nature and the number of banks are in constant evolution; the data between sources are cross-linked. The maintenance task is complex and heavy. A first stake consists in automating the process of updating the data banks for the administrator. Another significant stake to resolve is for the "quality" of service, providing to the users a clear vision of the integrity of data (state, exact origin, ... ) constitutive of their workspaces.

BioMAJ is a joint development between three bioinfomatics platforms : INRA Toulouse (David Allouche), INRA Jouy-en-Josas (Christophe Caron) and our platform. BioMAJ is written using state-of-the-art technologies (java, xml, ..) and is based on a parametrisable workflow engine. Post processes are written for the usual formats (gcg, blast, srs, ...) and are easily customisable at user's needs. BioMAJ has been relased under an opensource licence in September 2007. The software can be downloaded from the BioMAJ website http:// biomaj. genouest. org/

Geo Exporter: automating publishing of microarray data to Gene Expression Omnibus

In collaboration with the Institute of Cancer Research in Sutton, UK, we have developed a software named GEO EXPORTER which allows validating, extraction and automatic publication of microarray data stored in BASE (BioArray Software Environment), a widly used open source microarray data management system, to Gene Expression Omnibus.

GEO Exporter features can be divided according to the submission process steps: data selection, local data validation, generation of an exported data file (SOFT), remote validation at GEO, remote submission and submission management.

The program is freely distributed under GPL Licence. The software source, documentation, install instructions and demo are available at http:// www. sigenae. org/ geo_exporter.

This software has been used to publish human microarray experiment results .

Activity of transfer from Symbiose to GenOuest François Coste Michel Le Borgne Israël-César Lerman Hugues Leroy Jacques Nicolas Anne Siegel Basavanneppa Tallur Anthony Assi Annabel Bourdé Laetitia Guillot Grégory Ranchy Carito Guziolowski Goulven Kerbellec Sébastien Tempel Philippe Veber LLAhclust and CHAVLh: unsupervised hierarchical classification

We propose a LLA hierarchical clustering principles-based approach for unsupervised hierarchical classification. A dedicated software CHAVLh (Classification Hiérarchique par Analyse de la Vraisemblance des Liens en cas de variables hétérogènes) has been developed and registered at the APP agency (2005, I-C. Lerman, P. Peter (LINA), H. Leredde (Paris Nord)).

Heterogeneous description by several variable types can be handled by CHAVLh. Mathematically, the descriptive variables are interpreted in terms of relations on the object set. This software can be distinguised with respect to several items: 1. Dealing with descriptive variables of any types (numerical, categorical of any sort). 2. In the LLA method, probabilistic similarities, called likelihood of the link similarities, are established in order to compare mutually descriptive variables, described objects, variable classes or object classes. 3. In the classification tree built with the hierarchical clustering algorithm, the most "significant" levels - giving relevant partitions- and the most "significant" nodes - giving relevant aggregations - are detected.

This year, an ergonomic version called LLAhclusthas been implemented in the R environment and is now available on the platform http:// cran. r-project. org/ src/ contrib/ Descriptions/ LLAhclust. htmlhanks to a collaboration with I. Kojadinovic and P. Peter (École Polytechnique de l'Université de Nantes). This tool is also proposed in the framework of the european project ACGT.

Learning automata on sequences: Protomata Learner

Protomata Learner is a program for the inference of automata from samples of (unaligned) sequences belonging to a structural or functional family of proteins. Infered automata can be used to get new insights into the family, when classical multiple sequence alignments are insufficient, or to search for new family members in sequence data banks, with the advantage of a finer level of expressivity than classical sequence patterns (such as PSSM, Profile HMM, or Prosite Patterns). Its main interest is to allow the representation of heterogeneous sequence families. To learn such expressive models, Protomata Learner is based on the PhD work of G. Kerbellec and uses a sequence driven strategy based on the detection of strong local conservation and state merging generalization under preservation constraints Protomata Learner is deposited with the Agence pour la Protection des Programmes(APP) under Cecill licence.. It can be accessed on the GenOuest bioinformatics platform http:// protomata-learner. genouest. org/ where it offers an intuitive environment. A graphical viewer for the generated automata and the corresponding alignments gives a convenient way to analyse the relationship between sequences. When an interesting automata is found, it can be used to scan the sequences data banks, thanks to direct links to Protomata-Scan program.

Protomata Learner is written in Python and C, the web interface has been implemented with Python, PHP and JavaScript.

Domain Organizer

Discovering the underlying organization of any genome or sequence is of high importance in order to improve our understanding of living organisms. Domain organizeris a new tool dedicated to the analysis of repeated elements along biological sequences, resulting from the PhD work of S. Tempel. It relies on the association of language analysis, optimization and classification tools that allow identification, characterization and graphical representation of the combinations of elementary domains that make up each sequence of a given family. This tool is available for the community on the web site http:// genouest. org/ patterndiscovery. phpof the GenOuest platform.

Network analysis: Pyquali and Bioquali

Pyquali is a Python module dedicated to computations on qualitative models represented by interaction graph. Nodes of these graphs represents chemical species and arrows are labeled by { + , -}representing influence of the variation of a specie on another specie. An efficient representation of qualitative equations as ternary decision diagrams have been developped in this module. The python interface allows a user-friendly manipulation of equations. The module is available on GenOuest website http:// genoweb. univ-rennes1. fr/ Serveur-GPO/ outils/ help/ datafiles_test/ pyquali. tgz. Most useful applications of pyquali are accessible through a web access named Bioquali http:// genoweb. univ-rennes1. fr/ Serveur-GPO/ outils_acces. php3?id_syndic=226. It is dedicated to the analysis of network by following the rule which tells that a change on a molecule should be explained by the change of at least one of its direct predecessors in the graph that represents the influences on the network. Bioquali is dedicated to the following features: testing the self-consistency of a network, testing the consistency between a network and steady state data (such as expression data), diagnosing the location of errors when inconsistency occurs, predicting the variation of non-observed products when the network is coherent.

New Results Modelling sequence/structure relationships

Two lines of research are carried out within the framework of linguistic analysis of sequences. The first one aims at the expression of complex models on sequences. This will serve biologists to both validate his/her model with respect to whole genomes or set of sequences and to find new candidates in sequence data banks. The second complementary line of research aims at the automatic inference of such models from a set of sequences sharing a functional or structural property. In addition to finding new candidates, the goal is then to assist the biologist to gain new knowledge on the sequence family by the discovery of explicit models showing up the important parts of the sequences and their organization.

Finding modules in sequences Rumen Andonov François Coste Jacques Nicolas correspondant Dominique Lavenier Israël-César Lerman Anne Siegel Basavanneppa Tallur Sebastien Tempel Phillipe Veber Christine Rousseau Grégory Ranchy

J. Nicolas coordinates the national ANR project Modulomethat aims at modelling the structure of genomes in terms of assembly of «modules» that may be copied and move inside or between genomes. This is supported by three applications on genomic mobile elements in cooperation with URGI/Inra Versailles, LME/Ifremer Brest and LEPG/CNRS Tours. We have worked this year on the characterization of two types of modules. Corresponding softwares have been transferred on the bioinformatics platform.

Dynamics of transposable element Helitron.

The first study concerned the analysis of the dynamics of transposable element Helitron, a recently discovered type of transposon most widely spreaded in the genome of Arabidopsis thaliana. It has been concluded by the defense of S. Tempel's PhD . The thesis studies three relations between helitrons and their host genome: their mode of invasion, the modularity of their internal sequence and their impact on close genes. We have produced a formal grammatical model of these elements, made of two extremities separated by a highly variable sequence of bounded size that led to the proposition of a new nomenclature of helitrons in Arabidopsis based on their extremities instead of their global sequence . A method and a software, DomainOrganizer, allows to establish the domains' composition. It detects first domains' borders from a multiple alignment and provides a list of putative domains. From this list it searches an optimal coding of sequences in the family as concatenation of sequences extracted in the list . We formulate the coding of the family as a (discrete) optimization problem, and express it as a 0/1 linear program. Sequences are then clustered and visualized with respect to domains.

Analysis of CRISPR.

The second study concerned the analysis of an intriguing genetic structure in bacteria and archaebacteria, CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats). These are formed by a repetitive skeleton including genetic material that seems imported from viruses and plasmid. CRISPR are likely to represent a fundamental immune system in prokaryotes. However, a systematic study of this family was in need of relevant models. We have formalized an important and general notion in this respect, locality, which delimits with a single parameter the distribution of occurrences of repeated units in a sequence. We have also improved our genome sequence visualization method, Pygram, which is now able to display the hierarchical organization of repeated structures on multiple genomic sequences and to synchronize the search on these sequences.

Matching sequences with automata and logical grammars François Coste Jacques Nicolas correspondant Catherine Belleannée Pierre Peterlongo Thibaut Henin Goulven Kerbellec Sébastien Tempel Jessie Mahé

Fractal structures inside DNA.

The Chaos Game representation has been used by several laboratories to get a global view of repeats in genomes. Each DNA sequence generated nice pictures with different non random fractal structures. A fundamental question is to know "where do these structures come from?", and "how can we caracterize them?". We have shown that observed structures come from the presence of regular languages in DNA sequences. We have also proposed a method to quantify this presence and to characterize these structures with probabilistic automata. We have shown that this structures are in fact sufficient to discriminate genomes . Finally, the deeper understanding of the relation between the image level and the language level led us to introduce some variants on different problems.

Beyong regular patterns.

An ambitious platform to search for complex models within both DNA and protein sequences is under development. It is based on previous works made within the team in order to propose an expressive language (Stan and Wapam) that goes beyond pattern matching in biological sequences and study modelling needs of biologists at the level of whole genomes. Project's concerns are expressiveness, efficiency and ergonomy.

Wapam has been extended to handle the full expressiveness of automata. This new version, named Wascan will be accessible on the GenOuest bioinformatics platform. It can already be used through the Protomata-Scan interface on the GenOuest platform to scan databases with automata produced by Protomata-Learner .

Our main research focus this year was on the specification of the modelling language. The language, called Logol allows writing a particular form of Definite Clause Grammars, namely String Variable Grammars (SVG) in the line of Searls' work. We did deeply revisit SVG modelling, notably including in our models the fact that observed sequences are variants that have evolved from a common ancestor. The difficulty of modelling and detecting such repeats is that the reference ancestor sequence may well not appear at all in the analysed sequence. (e.g. AATT may be an observed string made of two copies, AA and TT, of a same ancestor TA). Formally, Logol is a dedicated constrained string language that considers two levels for chains: a level which designates abstract chains, and a level which designate concrete subsequences .

Protein structure alignment: algorithms and applications Rumen Andonov correspondant Nicola Yanev Guillaume Collet Noël Malod-Dognin Phillippe Veber

Two typical problems related with protein structure are Fold Prediction/ Recognition and Fold Comparison. By their nature, 3D computational problems are inherently more complex than the similar 1D ones for which efficient solutions have been yet developed. A theoretical basis that can provide rigourous support in understanding models for structure prediction and analysis is almost non-existent, as the problems are blend of continuous, geometric and combinatorial, discrete-mathematics. We focus notably on creating efficient exact algorithms for solving integer programming problems of proved interest: sequence/structure alignment (protein threading problem-PTP), and structure/structure comparison.

Sequence/structure alignment.

We have been working on PTP since 2001 in a very narrow collaboration with two teams . The first one is MIG from INRA, Jouy-en-Josas, and the second one is the Operation Research Departement from the University of Sofia, Bulgaria whose leader, Prof. Nicola Yanev is regularly visiting Symbiose. MIG created FROST the only french software participating in the prestigious competition CAFASP (Critical Assessment of Fully Automated Structure Prediction) which holds every two years. The dedicated optimization algorithm in this tool is currently based on the Lagrangian approach developed by our group .

The weak point of FROST and other existing methods is that it permits onlyglobal alignment of a sequence with a structure, that is, allowing omissions of blocks during the alignment process. Designing algorithms for flexible structures alignment is our current focus of interest. Some ideas to tackle this problem were presented in , based on MIP models for semi-global and local sequence/structure alignment.

Structure/structure comparison.

A multitude of measures have been proposed to quantify the similarity between protein 3-D structure. Among these measures, contact map overlap (CMO) maximization deserved sustained attention during past decade because it offers a fine estimation of the natural homology relation between proteins. Despite this large involvement of the bioinformatics and computer science community, the performance of known algorithms remains modest. Due to the complexity of the problem, they got stuck on relatively small instances and are not applicable for large scale comparison. Our results offer a clear improvement over past methods in this respect. We present a new integer programming model for CMO and propose an exact B&B algorithm with bounds computed by solving Lagrangian relaxation. The efficiency of the approach is demonstrated on a popular small benchmark (Skolnick set, 40 domains). On this set our algorithm significantly outperforms the best existing exact algorithms, and yet provides lower and upper bounds of better quality. Some hard CMO instances have been solved for the first time and within reasonable time limits. From the values of the running time and the relative gap (relative difference between upper and lower bounds), we obtained the right classification for this test. These encouraging result led us to design a harder benchmark to better assess the classication capability of our approach. We constructed a large scale set of 300 protein domains (a subset of ASTRAL database) that we have called Proteus_300. Using the relative gap of any of the 44850 couples as a similarity measure, we obtained a classication in very good agreement with SCOP. Our algorithm provides thus a powerful classication tool for large structure databases , .

Learning automata and grammars on protein sequences François Coste correspondant Goulven Kerbellec Matthias Gallé Pierre Peterlongo

This year, links between alignment and automata construction have been clarified. We formalized the partial alignment needed to build automata as an independant constrained matching optimization problem and we proposed a new heuristic algorithm to solve it . This new algorithm is faster than the previous heuristic and the semi-exact algorithm proposed in : it allows to handle efficiently classical real datasets. We proposed a Minimum Description Length measure to choose the best (non deterministic) automata with respect to the sequences sample. As a result, we are able to propose to the biologists Protomata-Learner, a program for the inference of automata on protein sequences that has been transferred in the GenOuest platform .

We began also this year a new line of research on learning more expressive grammars such as context-free grammars. Focusing rather on learning the structure than the language, we developed an approach based on recoding repeated elements. To handle large sequences such as genomes, efficient data structures and algorithms have to be used. To detect and score repeats, we used suffix arrays which needed to be regularly updated after each rewriting of repeat. We proposed and implemented efficient in-place algorithms to update suffix-arrays after each word recoding , .

Systems biology: analysing data and modelling interactions

The purpose of this axis is to contribute to gene and metabolite expression data analysis. The ultimate goal is to build dynamical systems that model interactions implied in biological processes.

Construction of models Michel Le Borgne correspondant Ovidiu Radulescu Anne Siegel François Moreews Pierre Blavy Jeremy Gruel Karim Tazibt

Effect of ADAM12 in TFG $\beta$ signalling pathway.

This thematic is the subject of a collaboration with Inserm and a cotutored Ph-D thesis (J. Gruel). Hepatic stellate cells (HSC) are a major actor of liver fibrosis, which appears as a deregulation of an inflamatory reaction which frequently evolves into cirrhosis and liver cancer. Under various agressions from toxics (alcool) or virus (hepatitis virus C), HSC undergo a transformation from a quiescent state to an activated one. Their response to TGF $\beta$ is quite different from the response of quiescient cells or epithelial cells. Their proteasome is also different and in particular, a metallo-protease, ADAM12 is expressed specifically in these cells. Our biologist collegues from Inserm have shown that this protein binds to TGF - $\beta$ receptor II and this binding is correlated with highest levels of TGF $\beta$ signaling ( ).

We have begun to model the traffic of TGF $\beta$ receptors in order to understand the effect of ADAM12. The model was specified in Garmen language, a modelling tool we developped in previous years. We had to write new code generators for Scilab for the simulations and Maple and Sage for algebraic computations. We were able to formulate different hypothesis on the action of ADAM12 on TGF $\beta$ receptor traffic. Using algebraic forms of the equations, it was possible to compute kinetic constants from measures in , giving three structural models. Comparison with other experimental measures ruled out two of these models. The remaining one predicts: (1) a sustained TGF - $\beta$ signaling after TGF - $\beta$ is washed out from the HSC; (2) a different ratio of endosomes/cell membrane TGF - $\beta$ receptors when ADAM12 is expressed. Experimentations are now underway, in the wet lab, and first results seem to confirm the first prediction.

In parallel, another study is underway to explore the specificity of the responce of HSC to five cytokines: IFN (inducing interferon), IFG (insulin like growth factor), TGF - $\beta$ (transforming growth factor), Leptine and PDGF (platelet derived growth factor). A gene expression profile as early (6h) response to each cytokine was established using micro-array techniques. Moreover, RT-PCR was used to measure the expression of 40 transcription factors in the same conditions. The goal of this study is to explore the peculiar signal pathways which are active in HSC. For this purpose, an influence graph was build from informations extracted from data bases. Statistical teatment of data is now finished and the data are under investigation. We use qualitative techniques similar to those used in the study of gene networks.

Fatty acid metabolism.

In collaboration with laboratories of INRA and supported by a cotutored Ph-D Thesis (ASC Inra program, P. Blavy), we investigated the regulation of fatty acids metabolism in hepatic cells. We initiated the construction of models at two levels of complexity: a low level abstract modeland a high level extended model.

The abstract model was literature-minded. We have gathered the set of behaviors that should be satisfied by the model. In this list of behaviors, we have included metabolic and genetic data during a diet protocol on mice (provided by our collaborators). We first identified the molecules implied in the target mechanisms, then we reduced the list of molecules and interactions in order to obtain a minimal modelcoping with the initial set of behaviors. First fitting analyses with data showed that fatty acids have to be distinguished in the model depending on both their elongation and their saturation. Our aim is to obtain an explicit differential model and then to compare the prediction of this model to observation in other tissues, mostly adipocyte cells.

Complementary to this approach, we have initiated the construction of a large scale transcriptional interaction model related to the fatty acid metabolism. We aim at comparing this model to DNA-chip data on diet chicken produced by our collaborators, in order to identify missing parts in the model. We have created a curated biological interaction network data source for lipid metabolism. This network contains signed transcriptional information, that is, known (and signed)) influences between molecules; concretely it is modeled as a graph constituted of HUGO gene IDs and signed directed edges. Information in this network are deduced from 1. A manual curated annotation of the literature. 2. A curation of information contained in dedicated databasis (Ingenuity and transpath) 3. Public databases (Kegg). The model contains now interactions among 400 genes. We expect to gather interactions on up to 600 genes.

Modularity, hierarchical models, and robustness of complex biological systems Ovidiu Radulescu correspondant Anne Siegel Laurent Chamoin Debasis Panda

On the theoretical side, we have investigated the relation between robustness and hierarchies of model scontaining at least an abstract and an extended model. The abstract model is the simplest model that still reproduces the behaviour of the biological system. It can be obtained by model reduction techniques. The extended model represents the level of variability, where different perturbations act. The mapping between the extended and the abstract model is a projection from high-dimension to low dimension that concentrates (in the Gromov/Talagrand sense) for robust properties. The model reduction method that can be used is modular, being based on the selection of a set of variables to be eliminated. Depending on the type of property that is studied, several choices are possible: rapid variables, Gale-Nikaido univoque response modules, monotone response modules. This approach was studied and detailed in the Habilitation à diriger des recherchesof O. Radulescu, defended by the end of 2006 .

Dynamical robustness of biological networks with hierarchical distribution of time scales.

We analysed the concepts of distributed robustness and r-robustness proposed by biologists to explain a variety of stability phenomena in molecular biology. We discussed the robustness of the relaxation time using a chemical reaction description of genetic and signalling networks. The following result holds for linear networks: for large multiscale systems with hierarchical distribution of time scales the variance of the inverse relaxation time (as well as the variance of the stationary rate) is much lower than the variance of the separate constants. Moreover, it can tend to 0 faster than 1/ n, where nis the number of reactions. We argue that similar phenomena are valid in the nonlinear case as well. As a numerical illustration we used a model of signalling network for the important transcription factor NFkB. , , ,

Module reduction on MAPK network.

We have applied a model reduction method based on Gale-Nikaido univoque response module of a model of MAPK signal transduction published by Bhalla . This model contains 80 products and 58 interactions and is too large complexity to be analysed. Our approach was to identify modules in this network that have a unique equilibrium state whatever the environment is, in order to reduce these modules to a unique node. A criterion to obtain such modules is the Gale-Nikaido univoque response theorem, leading to the search for negative subdeterminants in the matrix of the network. Meanwhile, computing the signs of all sub-determinants led to an untractable combinatorial issue. Then we reduced the search into biological submodules, and we used a stability rule in order to reduce the time of verification of the Gale-Nikaido property of 60%. The biological interpretation of this reduction remains to be done .

Algorithms for the analysis of models Michel Le Borgne Jacques Nicolas Ovidiu Radulescu Anne Siegel correspondant Carito Guziolowski Thibaut Henin Philippe Veber Annabel Bourdé Gregory Ranchy

Diagnosis of large-scale models.

For a couple of years, one of the questions we have been investigating is whether graphs of transcriptional interactions are able to help biologists to analyse DNA data. We have modeled this question as a formal problem of compatibility between an interaction graph (representing the knowledge) and qualitative variation of a partial subset of nodes of the interaction graph (representing DNA chip data). We are know able to propose a method to validate and correct a model by using qualitative data, based on the following principle: " during a steady state experimentation, every variation on a node should be explained by the variation of at least one predecessor”. This rule is transcribed as a set of constraints between the variations of the nodes and the signs of interactions. Then, analysing the network is nothing but studying the solutions to the constraints: (i) The internal consistencyof the network corresponds to checking that the whole set of constraints have at least a solution. (ii) Consistency between a network and datasetscorresponds to checking that a partial set of variations on node can be extended to a whole solution to the set of constraints. (iii) Diagnosingan inconsistent network means that if a system does not check the basic rule, we shall identify a subset of interactions and data that bear inconsistencies. (iv) Predictingnew variations corresponds to identifying the variables that have the same sign in all solutions of the set of constraints.

We have developed two different approaches to study the system of constraints generated by a graph and datasets:

Reducing the system of constraints to an equivalent subsystem then studying the solutions to the new system of constraints with a ternary decision diagram. This approach was initiated in 2004 and then implemented in a python plugin in order to manipulate efficiently the system of constraints (see Sec. ). We have proved that this approach is efficient on large scale regulation systems such as E. Coli transcriptional network , : transcriptional regulations have been proved insufficient to explain steady state datasets on E. Coli, and we identified the part of the system that had to be corrected (by including post translation regulation) to explain the behavior of the whole system.

Using constraints solvers (collaboration with T.Schaub's lab, Potsdam university), we have studied the applicability of Answer Set Programming (ASP) to solve the qualitative constraints used for checking consistency between a regulatory model and experimental data. The conclusion drawn from this preliminary work is twofold : first ASP solvers can solve particularly large qualitative constraints, which occur when comparing an organism-wide regulatory network with expression data; this was demonstrated in by considering genetic networks of several thousands of genes and regulations. Second, it appears that ASP, taken as a language, provides a simple and reasonably efficient way of developing and testing some extensions of our framework (in particular, stronger notions of consistency between models and data).

This approach is complementary to the previous one since decision diagrams allow to study the full set of solutions to the system of constraints (percentage of solutions satisfying a given condition) but its size of application is limited (at most 100 nodes).

We have illustrated the efficiency of this approach with a contribution to sign inferencein transciptional regulation network. Based on analyses with artificial datasets on E. Colinetwork, we have proved that we can expect to infer about 30% of the signs of the network from about 20 DNA-chip. We have then proposed an algorithm allowing to deal with errors, based on the verification of prediction by several different datasets. With this approach, we are able to infer correctly about 10% of the signs of E. Colinetwork . From a technical point of view, we had to combine decisions diagrams and constraint solvers to acheive these tasks, due to the size of the reduced system of constraints and the type of questions we had about the network. The complementary between theses approaches is the subject the Ph-D thesis of P. Veber, defended by the end of 2007 .

Reasoning on networks with constraint solvers.

Automated reasoning on metabolic pathways is the subject of a collaboration with the university of Potsdam and of a co-tutored PhD thesis (T. Henin). We are also collaborating with UMR 6026 (Molecular and Cellular Interaction) on the modelling of the regulations of bacteria genes implied in oxydadive stress.

We use public metabolic pathway databases such as Biomodels and Reactome. We are looking for models that allow explanation of observed behaviours of biological cells, abduction of new reactions (in order to explain some non yet explainable observations) and easy querying by biologists. Answer Set Programming seems a good framework for addressing these various issues.

Optimized Algorithms on Parallel Specialized Architectures ReMiX project: Reconfigurable memory for indexing genomic banks Dominique Lavenier correspondant Pierre Peterlongo Gilles Georges Julien Jacques Stéphane Rubini

Blasthas steadily become the reference software for exploring genomic banks. This type of algorithm and many other algorithms such as patternhunteror chaosproceed in two steps: first they seek for anchors, then they extend them into alignments. The load balancing between this two tasks depends on the quality of the anchors.

More generally, the problem of mining genomic banks is either bounded by the data access (the time for scanning all the banks) or the computation time (the time to detect good anchors) . We address this problem following two complementary ways: (1) speeding-up the anchor detection using reconfigurable hardware; (2) speeding-up the data access using indexing techniques ambedded with FLASH technology.

Compared to the previous RDISK project , the ReMiX project goes one step further by addressing the data access problem. The idea, here, is not to duplicate disk accesses, but to propose a hardware mechanism allowing fast random accesses to Gigabytes of data. In that way, indexing techniques accessing only a fraction of the bank become highly efficient.

In the ReMIX architecture, hard drives (RDISK project) are replaced by FLASH memories whose access time are 2 or 3 orders of magnitude shorter. In the same way, data bandwidth is increased by accessing simultaneously a large number of FLASH memories. As in the RDISK project, data are processed on-the-fly by reconfigurable hardware directly connected to the memory. Note that the reconfigurable index memory does not fit in the addressing space of the processor but it is indirectly accessed by specific queries. The whole system hold 512 Gbytes of FLASH memory.

Two main genomic applications have been implemented, illustrating the potential of the ReMIX concept. In the first application, the FLASH memory contains the index of large DNA banks, allowing fast retreival compared to traditionnal approaches (speed up ranging from 20 to 50). The second application is related to intensive comparison of a large set of proteins against the human genome. The two datasets are fully indexed and compared using the FLASH memory as a temporary storage support. Again, high performances have been exhibited: plugging only one ReMIX board on a standard PC decreases the computation time by about 50 .

In 2007, we have addressed the problem of optimizing data indexing of huge banks into the ReMIX memory. This has been performed through an Arc INRIA (called FLASH). Our research has been oriented in direction of subset seeds, based on the work of L. Noé and G. Kucherov (LIFL). A subset seed is composed by non-consecutive flexible characters. This kind of seed leads to better results than standard seeds but their design is much more difficult. Thus, in order to quickly compute subset seeds or a set of subset seeds characteristics, we have built large bank of exhaustive alignments by intensive computing of Smith-Waterman alignments. This corpus, with the use of the YASS program developed in the Sequoia team (LIFL), guided us for constructing several efficient sets of subset seeds, having a sensibility slightly better that the one used by Blastp. Those seeds were implemented on the ReMIX machine. The tests performed allowed us to notice a 16X speed-up factor. This speed-up is obtained thanks to a 13 speed-up hardware factor, while the pure algorithmic approach leads to a 24% additional speed-up .

Parallelization of bioinformatics computation Dominique Lavenier correspondant Van Hoa Nguyen Guillaume Rizk Anthony Assi Alexandre Cornu Gilles Georges Julien Jacques Ricardo Ascensio Xianchun Ye

This section details the different hardware we are currently investigating for supporting parallelization of various bioinformatics computations. We successively present FPGA, GPU, multicore and cluster approaches.

FPGA.

With the increasing amount of available complete genomes, the need for inter or intra genome comparison is now a reality and supercomputer manufacturers now propose to include dedicated accelerator boards in their machines. Through the ANR PARA project, in cooperation with the BULL R&D team (Les Clayes sous bois), we are currently designing a reconfigurable accelerator tightly interconnected to their system.

Even if FPGA component offer consequent processing power, one challenge remains to have these resources fed from the main memory at a very high speed. The PCIexpress interface, and especially the second PCIe generation, achieves this goal by providing an agregated bandwidth of 10 Gbytes/sec.

In 2007, we have developped the concept of "reconfigurable interface" for automatically generating high performance PCIexpress channels between reconfigurable operators and host processor. From a high description language, based on smart transfer actors, a generator will deliver a synthetizable VHDL code able to achieve high speed communication trough DMA channels. In the context of the PARA project, a parallel operator made of 512 pico-processors dedicated to protein sequence comparison will be connected to this "reconfigurable interface".

GPUs.

We have adapted the seed heuristic algorithm on one of the latest available graphics board (NVIDIA GeForce 8800 GTX). This adaptation is not straightforward. It needs to exhibit important low-level parallelism suitable to multithread execution. Starting from a matrix implementation on GPU (using the CUDA programming model), the seed-heuristics code has been modified to fit with the execution model of the GPU chip (a mutiprocessor SIMD architecture). Performances show a speed-up ranging from 5 to 10. These first experiments are very promising and a PhD thesis has been started on this topic (G. Rizk).

Multicore chip.

For the last 2-3 years, processor performance growth have been limited due to the difficulty of steadily increasing the clock frequency. Manufacturers and worldwild R&D team already envision multicore processors of hundred (or more) processors. These new architectures will really be efficient for high performance computation if code is targeted for these hardware structures.

Having these objectives in mind, we have revisited several bioinformatics codes in order to make them suitable for an efficient implementation on these architectures. In october 2007, we have started a collaboration with ICT (Institute for Computing Technology), Beijing, China, for implementing a parallel code for intensive genomic sequence comparison on the Godson-T multicore chip developed in this Institute.

Cluster.

In the Europeen ACGT project we actively participate to the development of a simulator for modelling cancer tumor development. This activity, referred as "In Silico Oncology" is a complex and multiscale combination of sciences and technologies in order to simulate malignant tumour growth and normal tissue response to therapeutic modalities at all levels of biocomplexity. The aim is to better understand cancer and related phenomena and to optimize therapeutic interventions by performing in silico (on the computer) experiments based on the individual data (clinical, imaging, histopathologic, molecular) of the patient.

We have in charge parallelizing the OncoSimulator code on the ACGT grid, and more specifically on nodes having huge computational resources (cluster) either for providing a better interactivity (shorter response time) or for providing a better accuracy of the model in a reasonable amount of time.

Efficient algorithms based on classical optimization technics Rumen Andonov Nicola Yanev

This research axis aims to design efficient algorithms based on classical optimization technics (like Dynamic Programing (DP), Branch&Bounds (B&B), Lagrangian Relaxation (LR), diverse Heuristics etc) for solving well known and largely studied optimization problem : the knapsack problem. Our last paper presents a preprocessing procedure for the 0-1 multidimensional knapsack problem. First, a non-increasing sequence of upper bounds is generated by solving LP-relaxations. Then, a non-decreasing sequence of lower bounds is built using dynamic programming. The comparison of the two sequences allows either to prove that the best feasible solution obtained is optimal, or to fix a subset of variables to their optimal values. In addition, a heuristic solution is obtained. Computational experiments with a set of large-scale instances show the efficiency of our reduction scheme. Particularly, it is shown that our approach allows to reduce the CPU time of the leading commercial software CPLEX of ILOG.

Other contributions Iterated morphisms Anne Siegel

The present work is the continuation of part of A. Siegel research, started before she arrived in the Symbiose project and does not concern bioinformatics.

Iterated morphisms of the free monoid are very simple combinatorial objects which produce infinite sequences by replacing iteratively letters with words . In some specific case (unimodular morphism of Pisot type), iterated morphisms can be understood in a geometrical framework, thanks to the construction of a fractal . This year was dedicated to the study of the topology of fractals and the application of topological properties to number systems. By combining the self-similar structure of a Rauzy fractal and its tiling properties, we obtain a suitable partition and description of the boundary of the fractal. We derive characterization of topological properties of the fractal: box dimension and Haussdorf dimension of the boundary, connectivity, homeomorphism to a disk, uncontable fundamental group . When the substitution is non unimodular (the dominant eigenvalue of its matrix is an algebraic number but it is not an integer number), the fractal is adjoined adic components. By describing the boundary of the resulting structure, we obtain a characterization of real number having a purely periodic greedy expansion in a non-integer basis an we given an explicit computation of the largest rational with a purely periodic greedy-expansion in a given basis (two examples are given) . The examples show that the behavior of purely periodic expansions in the non unit is slightly different from the unit case.

Quality of association rules in Data Mining. Israël-César Lerman

One main goal in image segmentation consists of splitting the set of the image pixels into homogeneous spatially connex parts. Hierarchical clustering under spatial contiguity constraint can contribute to a solution of this fundamental problem. Two types of criteria have been tested in the framework of the latter approach, the Ward Criterion and the LLA criterion. Accelerated algorithm using the nearest neighbors aggregation principle, has been established . This work is a result of collaboration with K. Bachar (CREDO Laboratory, ESSC Angers).

Classical data classification is concerned by symmetrical association measures. However, in the field of Data Mining association rules, one fundamental objective consists in building asymmetrical association rule measures. Relative to a rule (implication) A->B, such implication index (measure) evaluates in a certain way the propensity of B, knowing A. The LLA approach provides fruitful probabilistic indices for measuring the rule interest. However, a local definition depending solely on the rule to be evaluated becomes non discriminant for large data bases. In these conditions we have proposed a discriminant extension of the probabilistic indices obtained with respect to a set of potential interest rules. This work has been performed in collaboration with J. Azé (LRI laboratory, University of Paris Sud) . These ideas and developments have to be compared with other probabilistic approaches.

Other Grants and Activities Regional initiatives OUEST-genopole

OUEST-genopole, the eighth national genopole, funded in January 2002 acts as a strategic project for higher education and research in life sciences, bioinformatics, and for the economic development in the fields of marine sciences, agriculture and food processingand human health. It is a network, federated through a GIS structure (Scientific Interest Groupment), of the various academic organisms involved in these fields (More than 50 laboratories from Inra, Inserm, Ifremer, Inria, CNRS, Universities of Rennes, Nantes, Brest and Angers) in western France (Region Bretagne and Pays de la Loire). A network of technological platforms is proposed to all members.

OUEST-genopole is headed by M. Renard (Inra Le Rheu). Jacques Nicolas in charge of the bioinformatics research field, participates in the monthly meetings of the OUEST-genopole committee.

National initiatives

The Symbiose project is involved in the following national collaboration programs:

National Inra project Genanimal, Sigenae, BioWorkFlow and BioMAJ detailed hereafter.

National contracts Proteus, Modulome, PARA, SITCON, VICANNE, DyCoNum. These contracts are detailed heraafter.

ARC Inria (Action recherche concertée) FLASH and MOCA. These contracts are detailed hereafter.

National projects of the GenOuest platform Olivier Collin Hugues Leroy François Moreews Jacques Nicolas Anthony Assi Olivier Filangi

BioSide.

BioSide is a collaboration with ENSTB-Brest (P. Picouet, S. Bigaret, P. Tanguy), Station Biologique Roscoff (FR2424 - X. Bailly, E. Corre, G. Le Corguille). BioSide is an environment where bioinformatic programs are thoroughly described through a set of metadata (including the semantics of their parameters and the execution policy) thus providing access to original (and not rewritten) programs. The BioSide graphical interface, a Java based interface, intuitive and didactical, allows the design, the execution and the storage of workflows (called scenarios in the BioSide terminology). Metadata is a major element of this platform since it is used both to provide high level help to the final user (help to parameterize programs) and to guarantee dynamic extensibility of BioSide. A standalone version is already available for phylogenetic programs.

BioWorkFlow.

BioWorkFlow is a collaboration with INRA Jouy (C. Caron, V. Martin, J.F. Gibrat), INRA Toulouse (J. Gouzy, S. Carrère), Institut Pasteur (C. Letondal), RPBS (P. Tuffery), ENST-Bretagne (P.Picouet), Station Biologique Roscoff (E.Corre), URGI (F. Samson), Laboratoire Statistique et Génome Evry (M. Hoebeke).

This project, funded by ReNaBi, initiated a reflexion concerning the setup of workflows at a national level. At first, the different solutions available for bioinformatics will be analysed and a prototype will be implemented with the best solution.

BioMAJ.

BioMAJ is a joint collaboration between INRA Toulouse (David Allouche), INRA Jouy (Christophe Caron) and IRISA. It aims at the developement of a tool dedicated to the management of public biological databanks. Managing several databanks on any site is currently a tedious task involving a lot of time and man power. BioMAJ cope with this problem and proposes a complete environment dedicated to these tasks. It allow to quickly set up new databases on any site and to manage them efficiently (reports, errors management, etc.). The resulting software has been made available for the community in September 2007.

Sigenae and Genanimal.

The SIGENAE program (Analysis of Breeding Animals' Genome) is an Inra national program with the ambition to develop generic steps and finalized research actions in the domain of animal genomics. It aims at identifying the expressed part of genome, developing the map-making of entire genomes and studying genetic diversity in animal populations in the midst of several species of breeding animals (pig, chicken, trout, cow). It associates public research organizations (Inra, Cirad) and professional structures. At the international level, a privileged partner is the American ARS (Agricultural Research Service) which develops a comparable project. The transcriptome of three species (trout, chicken and pig), are studied in Rennes.

Symbiose collaborates to this program via an Inra engineer, F. Moreews, contributing to the Sigenae information system. The program is coordinated by Inra Toulouse. We are involved in this framework in a collaborative work with UMR Agrocampus-Inra 598: we collaborate on the modelling of fatty acids metabolism (see Sec. ) and contract eQTL. QTL (Quantitative Trait Loci) are biomarkers of genomic regions responsible of a substantial part of variations deserved on a given character. The aim of the project eQTL is to relate QTL regions obtained by linkage analysis and regions obtained by transcriptomic studies, responsible of the regulation of a set of genes.

ANR Projects Proteus Rumen Andonov Guillaume Collet Noël Malod-Dognin François Coste

PROTEUS Fold recognition and inverse folding problem : towards large scale protein structures predictionis a three years ANR grant Calcul intensif(ANR-06-CIS6-008) managed by the CEA. It started in January 2007 and involves also: i) BIOS: Biocomputing and Structure at the Ecole Polytechnique ; ii) MIG (Mathematics, Informatics and Genome), INRA Jouy-en-Josas; iii) the Physics Laboratory at the Ecole Normale Supérieure of Lyon; is) the ABI team, University Pierre et Marie Curie, Paris 6. The main co-ordinator is T. Simonson from Ecole Polytechnique. The project focuses on fold recognition and inverse folding problem. Massive genome sequencing has revealed the sequences of millions of proteins, yet only 30.000 3D protein structures are known. The classic but difficult « fold recognition » problem requires identifying the 3D structure among a library of possible structures. A complementary approach turns the problem around, and poses the «inverse folding problem » : to enumerate all the amino acid sequences compatible with a given 3D structure. In PROTEUS we will solve both problems. On the one hand, we will predict the fold of all bacterial proteins of unknown structure (300.000 proteins), using new techniques developed by the participants. On the other hand, we will solve the inverse folding problem for 1300 folds, out of 2300 known today (SCOP database), using the emerging technique of directed evolution, which mimics the natural evolutionary process. The first year report of the project is available on its site ( http:// migale. jouy. inra. fr/ proteus).

PARA: Parallelism and Improvement of Application Performances Dominique Lavenier Gilles Georges Alexandre Cornu Julien Jacques

The aim of this ANR project is to study and develop optimization methods to better exploit all parallelism aspects coming from modern computers. In this project, the Symbiose team is involved in the optimization of intensive comparison algorithms, and their implementations on a reconfigurable accelerator.

Modulome: Parallelism and Improvement of Application Performances Catherine Belleannée François Coste Dominique Lavenier Jacques Nicolas Pierre PeterLongo Sébastien Tempel Christine Rousseau

This ANR project, Modulome http:// www. irisa. fr/ symbiose/ projets/ Modulome/ , aims at providing methods for the identification, visualization and formal modelling of the structure of genomes in terms of an assembly of nucleotides "modules" that are repeated along a genome or between several genomes. Three other teams of Biologists and bioinformaticians are involved in this project: LDGE (Dynamique du Génome et Evolution), Institut Jacques Monod, Paris; LEPG (Etude des Parasites Génétiques), Tours; LM2E (Microbiologie des Environnements Extrêmes), Brest

Sitcon: modelling signal transduction induced by a chimeric oncogene Carito Guziolowski Ovidiu Radulescu Michel Le Borgne Anne Siegel Philippe Veber

French National Agency funded project (ANR grant "Biologie Systémique" program). The Ewing inducible cellular model, developed by one of the biologist partners of the project, is characterized by a malignant genomic translocation and appearance of a chimeric gene EWS/FLI-1 whose activity leads to the uncontrolled cell growth. The goals of the projects are to reconstruct the corresponding interaction network, including signal transduction pathways and from a detailed model of functioning, to propose new validation experiments.

DyCoNum: diophantine, dynamical and combinatorial studies of several numeration systems Anne Siegel

The French National Agency (ANR grant "Jeunes chercheurs" program) funded a project named DyCoNum aiming to consider by a transversal approach digital expansions in several number systems. This project focuses on integer base expansions, non-standard systems with integer base (signed digit expansions), beta-expansions and substitutive numeration systems, (generalized) continued fractions. This program involves W. Steiner and C. Frougny (LIAFA, Paris 7) and B. Adamczewski (Institut Camille Jordan, Lyon 1).

Other projects ACI VicAnne: animation of community of biological networks Michel Le Borgne Ovidiu Radulescu Anne Siegel Philippe Veber

The French ministry of research ( Ministry Grant (ACI) IMPBio program) funded a project named Vicanne aiming to support French workshops related to dynamics of biological networks from 2005 to 2007. Jean-Pierre Mazat (Université de Bordeaux II) is the coordinator of this project. Supported workshops are the epigenomic program (genopole Evry), three two-days working sessions on a specific theme, and a satellite workshop of the French conference of bioinformatics JOBIM.

ARC MOCA: MOdularité, Compositionalité et Abstraction dans les réseaux géniques et protéiques Michel Le Borgne Ovidiu Radulescu Anne Siegel

The development of formal languages aimed at the description of biological systems leads to new tools for reasoning about these systems. modelling, model inference, parameter estimation, model validation and checking of properties can be performed in a semi-automatic process (computer assisted). However, the growing complexity of models needs more conceptual tools. Modular decomposition is a promising one. Module definition, module composition and module abstraction/refinement are central in this approach. The goal of this ARC (Action de Recherche Cooperative) is to explore new clues in this new domain.

ARC FLASH: Seed Optimization and Indexing of Genomic Databases Dominique Lavenier Pierre PeterLongo Gilles Georges Julien Jacques

This project investigates the optimization of seeds in the context of large genomic database search using BLAST-like algorithms. Typically, seeds are optimized to reach a better search sensitivity. Here, for a given sensitivity, we want to investigate how seeds can be optimized to reduce the size of the database index. It includes (i) Theoretical study based on previous works of G. Kucherov and L. Noe (LIFL) to design new seeds; (ii) Test of these new seeds on a real challenging problem: comparing 700 000 proteins agains the Human genome (in cooperation with INSERM U694, Angers); (iii) Improve the ReMIX programming for speeding up the implementation of new reconfigurable operators based on these new seeds (in cooperation with the LESTER lab., Lorient).

European initiatives Integrated Project ACGT Dominique Lavenier Israël-César Lerman Basavanneppa Tallur Julien Jacques Anthony Assi

The project http:// eu-acgt. org/ home. htmlaims at delivering the cancer research community an integrated CIT environment enabled by a powerful GRID infrastructure. Our contribution concerns parallelism (Grid development, tumor growth simulation) and data mining (integration of CHAVL in a R environment). In 2007, as a contribution to this program, a user-friendly version of the data analyses software CHAVL was developed (see Sec. ).

Egide project: Modularity, Abstraction and Robustness of Network Models in Molecular Biology Ovidiu Radulescu

Principal Investigators of this 2007-2008 project are A.N. Gorban (Leicester, UK), O. Radulescu (Univ. Rennes, France), INRA Génétique animale (ASC project Rennes, Toulouse) and Institute Curie, Paris. This project involves several domains: mathematics, molecular biology and bio-informatics. Its purpose is to develop mathematical techniques for the analysis of large-scale non-linear differential equations systems. These systems should describe molecular biology networks. We propose to give an answer to the following questions: 1). How can the complexity of such a system be reduced and what is the minimal complexity of a model that overcomes the most important behaviors? 2). How can the robustness of a biological system be explained We propose to study a simplified model for fatty acids metabolism including metabolic, genetic and hormonal regulations.

Bulgaria: exchange research program RILA (PAI) Rumen Andonov Nicola Yanev

This program is managed by the French Ministry of Foreign Affairs http:// www. egide. asso. fr/ uk/ programmes/ . The project focusses on the application of combinatorial optimization techniques in two different domains, Protein Threading and automata inference for discovering signatures of a sequence. Both domains are rich in NP-hard problems and the goal of the project is to propose and to analyse new mathematical models allowing to accelerate the solution of these problems. This program involves R. Andonov, J. Nicolas, F. Coste and D. Lavenier.

Regional cooperations

The Symbiose project has collaborations with many laboratories, mostly biological, in western France. Collaborations are detailed in the section devoted to new results. Among the most advanced, let us mention:

Agrocampus-Inra Rennes - Laboratoire de Génétique Animale: Analysis of gene regulation involved in the lipid metabolism (P. Blavy, O. Radulescu, A. Siegel).

Inserm U456 (Détoxication et réparation tissulaire). Study of gene regulations in TGF $\beta$ signalling in liver cancer (J. Gruel, M. Le Borgne, O. Radulescu, P. Veber).

École Polytechnique de l'Université de Nantes : integration of CHAVL in a R environment (I.-C. Lerman). Related to european project ACGT.

École Supérieure des Sciences Commerciales d'Angers (CREDO laboratory) integration of CHAVL in a R environment (I.-C. Lerman)

Laboratoire de Microbiologie UMR1253 STLO, INRA-Agrocampus, 65, rue de Saint Brieuc CS84215, 35042 Rennes cedex, France. Design of primers (D. Lavenier)

UMR-CNRS 6026 Structure et Dynamique des Macromolécules, Rennes(C. Delamarche): Major Intrincsic Proteins (MIP) (F. Coste, G. Kerbellec, G. Ranchy)

VALORIA, UBS, Vannes: ReMIX project (D. Lavenier, G. Georges, S.Rubini).

LESTER, Lorient, ARC FLASH (D. Lavenier, G. Georges).

UMR-CNRS 6553 - EcoBio : Arabidopsis thaliana transposons (J. Nicolas, S. Tempel), database on abyssal fungi (A. Assi).

MIcroenvironnement et CAncer, UPRES UA 3889 (Thierry Guillaudeux) : Tumor Necrosis Factor (F. Coste, G. Kerbellec)

National collaborations

The Symbiose project has worked with and/or welcomed in Rennes the following french collaborators:

MIG, Inra, Jouy en Josas (J.-F. Gibrat, A. Marin): Protein threading, Proteus (R. Andonov, F. Coste, D. Lavenier)

Institut de Biologie et de Chimie des Protéines (R. Andonov, F. Coste)

CEA, Saclay: Laboratoire d'Intégration des Systèmes et des Technologies, Service Architectures et Conceptions. Reconfigurable Architectures Reconfigurable computing (D. Lavenier).

LEPG Tours (Y. Bigot) : ANR Modulome (J. Nicolas, C. Rousseau, F. Coste).

URGI Versailles (H. Quesneville): ANR Modulome (J. Nicolas, C. Rousseau).

IML, Marseille and LIRMM, Montpellier: substitutive dynamical systems (A. Siegel).

LIFL, Lille (D. Lavenier, P. Peterlongo).

Curie Institute, biology, medecine and bioinformatics, Paris (O. Radulescu, A. Siegel, M. Le Borgne)

CNRS, biology, Paris : Alain Lilienbaum. (O. Radulescu)

U.Paris 7, rheology : Sandra Lerouge. (O. Radulescu)

ABI, Paris VI University (Joël Pothier): structural cores (F. Coste).

International cooperations Bilateral cooperations

Institute of Computing Technology, Beijing, Chine. Parallelization of bioinformatics algorithms onto multicore processors (D. Lavenier)

Sofia University, Protein structure prediction (R. Andonov)

US, Stony Brook University, Drosophila developmental biology: John Reinitz (O. Radulescu).

Russia, St. Petersburg, mathematics: Sergei Vakulenko, modelling : Vitaly Gursky, bioinformatics: Maria Samsonova. (O. Radulescu).

India, NCBS Bangalore, systems biology: Upi Bhalla, biophysics : Madan Rao. (O. Radulescu).

UK, U.Leicester, applied mathematics: Alexander Gorban. (O. Radulescu).

Germany, Postdam university, Prof. T. Schaub's lab. Logic programming and boolean constraint solving. (P. Veber, J. Nicolas, T. Henin).

Greece, Institute of Communication and Computer Systems, National Technical University of Athens, cooperation: parallelization of cancer tumor growth (oncosimulator) (D. Lavenier, A. Assi, J. Jacques)

Poland, Poznan Supercomputing and Networking Center: Gridification of the oncosimulator on the ACGT GRID (D. Lavenier, A. Assi, J. Jacques)

Visiting scientists

The following scientists visited the Symbiose project.

Prof. Sergio Camiz (Sapienza University, Rome) visited I.-C. Lerman during the week 26-30 of March 2007. They worked on the subject of empirical behaviour of CHAVLh in case of heterogenous data.

Prof. Shigeki Akiyama (Niigata university, Japan) visited A. Siegel during one week in Sept. 2007. They investigated non integer number systems properties.

Associate Prof. Joerg Thuswaldner (Leoben university, Austria) visited A. Siegel during one week in March 2007. They investigated topological poperties of fractal sets.

Professor Sergei Vakulenko, Institut typographique, Saint Petersburg, visited O. Radulescu during 30 days. They worked on network reduction problem.

Professor Alexander Gorban, Department of Mathematics, Université de Leicester, visited O. Radulescu during 7 days. They worked on network reduction problem.

Professor Radu Burlacu, Department of Mathematics, Institut d'Agronomie, Bucarest, visited O. Radulescu during 7 days. They investigated modelling question related to agriculture problems.

The Symbiose project supported the following scientific visits:

Bulgaria: 14 days in Sofia, funded by the Hubert Curien French-Bulgarian partnership “RILA 2006” No 15071XF Prédiction et comparaison des structures tridimensionnelles des protéines[R. Andonov]

Leceister, UK. Two one-week visits, collaboration with A. Gorban, funded by an Egide Project [O. Radulescu]

United-States, university of Stony-Brook, Collaboration with John Reinitz, Apr. 2007 [O. Radulescu]

Postdam university, Germany. 3 monthes visit to the group led by Prof. Torsten Schaub. The main research interest of the team is logic programming and boolean constraint solving. More precisely, the group works on the implementation of Answer Set Programming (ASP) solvers, and on the definition of so-called Action Languages (AL). ASP and AL are a family of languages for logic programming which are particularly suited to the specification of combinatorial and planning problems. Notably, T. Schaub's group has developed a solver which is one of the most efficient among currently available implementations. [P. Veber]

Postdam university, Germany. Collaboration during 10 days with Torsten Schaub on constraint solver ASP [T. Henin].

Dissemination Leadership within scientific community Fête de la Science 2007

In the context of the Fête de la scienceO. Radulescu gave a lecture Mathématiques des cellulesin Rennes, Oct. 2007.

Fifth meeting dealing with the Bioinformatics platform of OUEST-genopole

The fifth meeting dealing with the Bioinformatics platform of OUEST-Genopole held at Irisa, Rennes on Oct. 23, 2007. It was organized by Sophie Roucan, Olivier Collin and Hugues Leroy. ( http:// genoweb. univ-rennes1. fr/ Serveur-GPO/ article. php3?id_article=510). Invited speakers included Benno Schwikowski (Institut Pasteur).

Tribute day to L. Trilling

A special meeting was organized to the tribute of L. Trilling at the occasion of its retiring from teaching duties. It held on June, 22nd 2007. J. Nicolas was the main organizer of this meeting. Invited speakers were J.P Verjus, J.P. Banâtre, J. Cohen, C. Desmoulins, P. Besnard, R. Allen and J. Nicolas. http:// www. irisa. fr/ videos/ seminaire_Irisa/ trilling/ index. htm).

VicAnne Meeting

In the framework of ACI VicAnne, P. Veber and A. Siegel organized a two-days meeting on biological networks. The subject was Boolean networks and celular automata in molecular biology. Invited speakers were Philippe Narbel (Labri, Bordeaux), Nazim Fates (LORIA, Nancy), Carsten Peterson (Lund University), Claudine Chaouiya (TAGC, Marseille), Vincent Fleury (GMCM, Université de Rennes). The first day was an free conference day. The second day was dedicated to discussions and working groups.

BioInfoOuest thematic-day conferences

The Symbiose project regularly organizes thematic-day conferences on bioinformatics subjects http:// www. irisa. fr/ sci-events/ seminars/ bioinfo. The public of this thematic-day is made of computer scientists as well as biologists. Usually, this public gathers 50 persons (with 50 % of biologists) coming from all western France. Two thematic-day conferences were organized during the year 2007, about

Transcriptional regulation sequences. This day was organized by F. Coste. Invited speakers were Maximilian Hauessler (MSNC, Gif/Yvette), Jean Philippe Vert (Ecole des Mines de Paris, Centre for Computational Biology) and Jean Imbert (Centre de Recherche en Cancérologie de Marseille).

Boolean networks and celular automata in molecular biology. See the previous paragraph.

Symbiose Seminar

The Symbiose seminar is held on a bi-weekly basis. 15 talks were given in this framework during the year 2007. Invited speakers can be local speakers as well as national speakers. The public is usually made of the members of the Symbiose project. However, biologists, computer scientist (Irisa) or mathematicians (Irmar) often attend the seminar, depending on the subject of the conference.

Administrative functions: scientific commitees, journal bords, jury

Scientific Advisory Board: BIOINFAPA, INRA (Bioinformatics for Animal genomics) and Ouest Genopole (J. Nicolas).

Editorial Board of Mathématiques et Sciences Humaines, Mathematics and Social Sciences[I.-C. Lerman].

Editorial Board of La Revue de Modulad[B. Tallur]

Member of the Scientific commitee of the french ministry program ANR 2007 Architectures du futur [D. Lavenier].

Scientific committee of The Prix de thèse Gilles Kahn 2007[D. Lavenier]

Member of the Scientific & Technical Council of the European Morpheus project [D. Lavenier]

Jury of PhD Theses

Jury of the PhD Thesis of Julien Ah-Pine, Sur des aspects algébriques et combinatoires de l'analyse relationnelle. Applications en Classification Automatique, en Théorie du choix social et en théorie des tresses, University of Paris 6, [I-C. Lerman, referee].

Jury of the PhD Thesis of Adrien Goëffon Nouvelles heuristiques de voisinage et mémétiques pour le problème Maximum de Parcimonie[R. Andonov]

Jury of the PhD Thesis of Laurent Gaubert, ENIB Brest Auto-organisation et émergence dans les systèmes couplés, individuation de données issues de systèmes biologiques couplés, Applied Mathematics [O. Radulescu]

Jury of the PhD Thesis of PhD of Sabri Bayyouh, Apprentissage par proportion analogique, Enssat Lannion [J. Nicolas].

D. Lavenier was a member of the Jury of the following PhD Theses

Jeremy Detrey, Arithmétique Réelle sur FPGA, virgule fixe, virgule flottante et système logarithmique, ENS Lyon [Président].

Arthur Segard Conception d'une architecture pour le décodage LDPC : application au cas des codes DVB-S2Université de Cergy Pontoise [ Président].

Marc Aubry, Annotation fonctionnelle de groupes de gènes issus de l'analyse transcriptomique, Université de Rennes 1.

Alix Poungou, Nanotechnologies et Architectures reconfigurables, Université de Bretagne Occidentale [Referee]

Antony Lebechec, Gestion, Analyse et Intégration des données transciptomiques, Université de Rennes 1

Julien Pley, Protocoles d'accord pour la gestion d'une grille de calcul dynamique, Université de Rennes 1

Emeric Sevin, Annotation des petits éléments dans les génomes procaryotes: nouveaux outils informatiques et application à Sinorhizobium meliloti

Sandrine Pawlicki, Approche bioinformatique des mécanismes d'agrégation et de polymérisation des protéines amyloïdesUniversité de Rennes 1

Elise Prieur, Méthodes et structures de données pour l'indexation et la détection de répétitions dans les séquences biologiques, Université de Rouen

Faculty teaching

Members of the Symbiose project are actively involved in the bioinformatics teaching program proposed by the University of Rennes 1. Furthermore, R. Andonov and D. Lavenier respectively share the responsibility of the 4th and 5th year bioinformatics master degrees, with biologist colleagues from the life science department Vie-Agro-Santé. The originality of this 2 year training program lies in recruiting both biologists and computer scientists.

Besides the usual teachings of the faculty members, the Symbiose project is involved in the following programs:

Master 1 & 2 BioInformatics. (R. Andonov, D. Lavenier)

Master 2 Computer Science, IFSIC. (A. Assi, F. Coste, H. Leroy)

Master 2 Computer Science, ENST. (H. Leroy)

Master 2 Mathematics. (B. Tallur)

Master 2 modelisation and intensive computing Lebanese University of Beirut (H. Leroy)

Master Biotechnologie et Bio-santé, Université de Brest (O. Collin)

Ecole Supérieure Agronomique de Rennes (F. Coste)

Engineering school S3, ENST-Bretagne Brest (O. Collin)

Bioinformatics, ESEO Engineering school, Angers (D. Lavenier)

INRIA/EDF/CEA School, Oct. 2007 Réduction de Modèles(O. Radulescu)

Conference and workshop committees, invited conferences Conference committees

CAp 2007, Conférence Apprentissage (F. Coste, program committee)

EGC'2007: Knowledge extraction and management (Extraction et Gestion des Connaissances 2007), January 2007, 23-26, Namur, Belgium (I.-C. Lerman, program committee)

ERSA 2007: International Conference on Engineering of Reconfigurable Systems and Algorithms, june 25-28, 2007, Las Vegas, Nevada, USA (D. Lavenier, program committee).

FPL 2007: International Conference on Field Programmable Logic and Applications, Amsterdam, Netherlands, 27-29 August 2007 (D. Lavenier, program committee).

JOBIM 2007, Marseille, July 2007 (A. Siegel, program committee)

MajeSTIC 2007: MAnifestation des JEunes Chercheurs en Sciences et Technologies de l'Information et de la Communication, Caen 29-31 octobre 2007, France. (R. Andonov)

Research-In-Team Generalized substitution and tilings and numeration, Porquerolles, June 2007 (A. Siegel)

ParaFPGA : Parallel Computing with FPGA's, symposium in conjunction with ParCo2007, 4-7 September 2007. (D. Lavenier, program committee)

PBC 2007 : Parallel Bio-Computing, Workshop on Parallel Computational Biology, Gdansk, Poland, September 9-12, 2007 (D. Lavenier, program committee)

RC education : The 2nd International Workshop on Reconfigurable Computing Education May 12, 2007, Porto Allegre, Brasil. (D. Lavenier, program committee)

RIAMS 2007, Réseaux d'interaction : analyse, modélisation et simulation, Lyon, Novembre 2007 (A. Siegel, program committee)

SPL 2007 : The III Southern Conference on Programmable Logic February 26-28, 2007 - Mar del Plata, Argentina (D. Lavenier, program committee)

First Southern African Bioinformatics Workshop (F. Coste, Reviewer)

International invited conferences

Euromech meeting on "Shear Banding", Londres, UK, Sept. 2007, (O. Radulescu)

NIH-INRIA Biomedical and Life Sciences Computing Workshop, US, Apr. 2007, Bethesda, US. (O. Radulescu)

Center for Developmental Genetics, University of Stony Brook, US, pr. 2007, Stony Brook, US (O. Radulescu)

Max Planck Institute Potsdam-Golm, Potsdam, Germany, Jul. 2007, Improving model consistency using qualitative equations(P. Veber)

DFG-Research Center Matheon, Freie Universität Berlin, Germany, Nov. 2007 Improving model consistency using qualitative equations(P.Veber)

National invited conferences

Séminaire de Xerox XRCE, Grenoble (F. Coste)

Séminaires Codes & Travaux, IRISA, Apr. 2007, Introduction aux SGBDR et conception d'un système d'information avec MERISE[A. Assi].

Cinquièmes rencontres autour de la plateforme Bio-informatique, Oct. 2007 P2P and Grid combination[A. Assi].

Cinquièmes rencontres autour de la plateforme Bio-informatique, Oct. 2007 Mise en place d'un système de manage ment qualité pour la plate-forme GenOuest[A. Bourdé]

Séminaire des groupes de travail de INRIA, Oct. 2007 Exemple concret de certification ISO 9001 sur les plates-formes technologiques de OUEST-genopole : intérêt, difficultés rencontrées, bénéfice retiré et retour d'expérience[A. Bourdé].

Cinquièmes rencontres autour de la plateforme Bio-informatique, Oct. 2007 La plate-forme GenOuest[O. Collin].

Cinquièmes rencontres autour de la plateforme Bio-informatique, Oct. 2007, BioMAJ : pour la mise à disposition des données biologiques[O. Filangi]

ENS Cachan, Antenne de Bretagne, Architectures Matérielles Spécialisées pour la Bioinformatique, Oct. 2007 [D. Lavenier].

Ecole thématique Architectures des systèmes matériels enfouis et méthode de conception associées, Architectures Spécialisées pour la bioinformatiqueMarch 2007 [D. Lavenier],

Journée en l'honneur de L. Trilling, Rennes, June 2007 Expérimentation sur le vivant: de la souris à la puce[J. Nicolas]

Journée Nationale du Groupe de travail SDA2 du GDR Informatique-Mathématique, Oct. 2007, Topologie de fractals et automates[A. Siegel].

Séminaire du Laboratoire Systèmes Navals Complexes, Toulon, Dec. 2007, Analyses de réseaux grande-échelle[A. Siegel].

Séminaire de l'unité INSERM U456, Rennes, Feb. 2007, Utiliser les données issues de la littérature : quand, comment et pour quoi faire ?[P. Veber]

Séminaire de bioinformatique du LINA, Nantes, June 2007, Modélisation qualitative et inférence dans les réseaux de régulations[P. Veber]

Protein Threading Problem: From Mathematical Models to Parallel Implementations R. Andonov R. S. Balev S. N. Yanev N. INFORMS Journal on Computing 16 4 2004 393–405 GenoFrag: software to design primers optimized for whole genome scanning by long-range PCR amplification N. Ben Zacour N. M. Gautier M. R. Andonov R. D. Lavenier D. P. Veber P. A. Sorokin A. Y. Leloir Y. Nucleic Acid Research 32 1 2004 A Similar Fragments Merging Approach to Learn Automata on Proteins F. Coste F. G. Kerbellec G. J. Gama J. R. Camacho R. P. Brazdil P. A. Jorge A. L. Torgo L. European Conference on Machine Learning (ECML-2005), Porto, Portugal LNAI 3720 Springer 2005 522–529 Browsing repeats in genomes: Pygram and an application to non-coding region analysis P. Durand P. F. Mahe F. A.-S. Valin A.-S. J. Nicolas J. BMC Bioinformatics 7 2006 477 http:// www. biomedcentral. com/ content/ pdf/ 1471-2105-7-477. pdf Cluster of re-configurable nodes for scanning large genomic banks S. Guyetant S. M. Giraud M. L. L'Hours L. S. Derrien S. S. Rubini S. D. Lavenier D. F. Raimbault F. Parallel Computing 31 1 2005 Comparing classification tree structures: A special case of comparing q-ary relations I & II Israël-César Lerman I.-C. François Rouxel F. RAIRO Operations Research 33 & 34 1999 339-365 & 251-281 Suffix-Tree ANalyser (STAN): looking for nucleotidic and peptidic patterns in chromosomes. J. Nicolas J. P. Durand P. G. Ranchy G. S. Tempel S. A.-S. Valin A.-S. Bioinformatics oct. 2005 http:// dx. doi. org/ 10. 1093/ bioinformatics/ bti710 The dog and rat olfactory receptor repertoires P. Quignon P. M. Giraud M. M. Rimbault M. P. Lavigne P. S. Tacher S. E. Morin E. E. Retout E. A.-S. Valin A.-S. K. Lindblad-Toh K. J. Nicolas J. F. Galibert F. Genome Biology 6 10 2005 R83 Complex Qualitative Models in Biology: A new approach P Veber P. M Le Borgne M. A Siegel A. S Lagarrigue S. O Radulescu O. Complexus Doi: 10.1159/000093686 2 3-4 2006 140 – 151 http:// content. karger. com/ ProdukteDB/ produkte. asp?Aktion=JournalHome&ProduktNr=227088 Checking Consistency Between Expression Data and Large Scale Regulatory Networks: A Case Study Carito Guziolowski C. Philippe Veber P. Michel Le Borgne M. Ovidiu Radulescu O. Anne Siegel A. The Journal of Biological Physics and Chemistry 7 2007 37-43 http:// hal. inria. fr/ inria-00178914/ en/ Model-based Identification of Helitrons Results in a New Classification of Their Families in Arabidopsis thaliana Sébastien Tempel S. Jacques Nicolas J. Abdelhak El Amrani A. Ivan Couée I. Gene 403 2007 http:// hal. inria. fr/ inria-00180376/ en/ Dynamique des hélitrons dans le genome d'Arabidopsis thaliana : développement de nouvelles stratégies d'analyse des éléments transposables Sébastien Tempel S. Ph. D. Thesis Université de Rennes 1 2007 http:// tel. archives-ouvertes. fr/ tel-00185256/ en/ Modélisation grande échelle de réseaux biologiques :vérification par contraintes booléennes de la cohérence des données Philippe Veber P. Ph. D. Thesis Université de Rennes 1 2007 http:// tel. archives-ouvertes. fr/ tel-00185895/ en/ Recent advances in solving the protein threading problem Rumen Andonov R. Guillaume Collet G. Jean-François Gibrat J.-F. Antoine Marin A. Vincent Poirriez V. Nikola Yanev N. El-Ghazali Talbi E.-G. Albert Y. Zomaya A. Y. Grid Computing for Bioinformatics and Computational Biology Wiley 2007 http:// hal. inria. fr/ inria-00180695/ en/ Purely Periodic beta-Expansions in the Pisot Non-unit Case Valérie Berthé V. Anne Siegel A. 0022-314X Journal of Number Theory 127 2007 153-172 http:// hal. inria. fr/ inria-00181997/ en/ Mapping sequences by parts Gilles Didier G. Carito Guziolowski C. 1748-7188 Algorithms for Molecular Biology 2 2007 11 http:// hal. inria. fr/ inria-00180004/ en/ Dynamic and static limitation in multiscale reaction networks, revisited Alexander Gorban A. Ovidiu Radulescu O. 0009-2509 Chemical Engineering Science 2007 http:// hal. inria. fr/ inria-00184976/ en/ Dynamical robustness of biological networks with hierarchical distribution of time scales Alexander Gorban A. Ovidiu Radulescu O. 1741-2471 IET Systems Biology 2007 http:// hal. inria. fr/ inria-00181450/ en/ Checking Consistency Between Expression Data and Large Scale Regulatory Networks: A Case Study Carito Guziolowski C. Philippe Veber P. Michel Le Borgne M. Ovidiu Radulescu O. Anne Siegel A. 1512-0856 The Journal of Biological Physics and Chemistry 7 2007 37-43 http:// hal. inria. fr/ inria-00178914/ en/ A Reconfigurable Index FLASH Memory tailored to Seed-Based Genomic Sequence Comparison Algorithms Dominique Lavenier D. Gilles Georges G. Xinchun Liu X. 1939-8018 The Journal of VLSI Signal Processing 48 2007 255-269 http:// hal. inria. fr/ inria-00178314/ en/ Path-equivalent developments in acycliv weighted automata Dominique Lavenier D. Giraud Mathieu G. Verber Philippe V. 0129-0541 International Journal of Foundations of Computer Science (IJFCS) 18 2007 799-812 http:// hal. inria. fr/ inria-00178293/ en/ A New Probabilistic Measure of Interestingness for Association Rules, Based on the Likelihood of the Link Israel-César Lerman I.-C. Jérôme Azé J. F. Guillet F. H. Hamilton H. Quality Measures in Data Mining. Studies in Computational Intelligence Springer 2007 207-236 http:// hal. inria. fr/ inria-00180117/ en/ Representation of Concept Description by Multivalued Taxonomic Preordonance Variables Israel-César Lerman I.-C. Philippe Peter P. P. Brito P. P. Bertrand P. G. Cucumel G. F. de Carvalho F. Selected Contributions in Data Analysis and Classification Studies in Classification, Data Analysis, and Knowledge organization Studies in Classification, Data Analysis, and Knowledge organization Springer 2007 271-284 http:// hal. inria. fr/ inria-00180101/ en/ Indexing gapped-factors using a tree Pierre Peterlongo P. Julien Allali J. Marie-France Sagot M.-F. 0129-0541 International Journal of Foundation of Computer Science 2007 http:// hal. inria. fr/ inria-00179719/ en/ Lossless filter for multiple repetitions with Hamming distance Pierre Peterlongo P. Nadia Pisanti N. Frédéric Boyer F. Alair Pereira Do Lago A. Marie-France Sagot M.-F. 1570-8667 Journal of Discrete Algorithms 2007 http:// hal. inria. fr/ inria-00179731/ en/ Théorèmes limites pour des processus de Markov à sauts. Synthèse des résultats et applications en biologie moleculaire Ovidiu Radulescu O. Aurélie Muller A. Alina Crudu A. 0752-4072 TSI (Technique et Science Informatiques) 26 2007 443-469 http:// hal. inria. fr/ inria-00181451/ en/ Large Gene Family Expansion and Variable Selective Pressures for Cathepsin B in Aphids Claude Rispe C. Mayako Kutsukabe M. Vincent Doublet V. Sylvie Hudaverdian S. Fabrice Legeai F. Jean-Christophe Simon J.-C. Denis Tagu D. Takema Fukatsu T. 0737-4038 Molecular Biology and Evolution 2007 http:// hal. inria. fr/ inria-00180087/ en/ Optimiser un plan d'expérience à partir de modèles qualitatifs? Anne Siegel A. Carito Guziolowski C. Philippe Veber P. Ovidiu Radulescu O. Michel Le Borgne M. 0294-3506 Biofutur 275 2007 27-31 http:// hal. inria. fr/ inria-00178791/ en/ Model-based Identification of Helitrons Results in a New Classification of Their Families in Arabidopsis thaliana Sébastien Tempel S. Jacques Nicolas J. Abdelhak El Amrani A. Ivan Couée I. 0378-1119 Gene 403 2007 http:// hal. inria. fr/ inria-00180376/ en/ Comparaison de deux critères de la CAH sous contrainte de contiguïté spatiale Kaddour Bachar K. Israel-César Lerman I.-C. J-M. Marion J.-M. 39-èmes Journées de Statistique de la SFdS, Angers France Université d'Angers 2007 http:// hal. inria. fr/ inria-00180155/ en/ Journées de Statistique 39 JDS Problème d'optimisation de recherche de cliques pour caractériser des familles de protéines François Coste F. Goulven Kerbellec G. Jean Charles Billaut J. C. ROADEF 2007, Grenoble France Universités de Grenoble

2007 157–158 http:// hal. inria. fr/ inria-00180494/ en/ Congrès de la Société Française de Recherche Opérationnelle et d'Aide à la Décision 8 ROADEF Mise à jour incrémentale de tableau des suffxes en cours de recodage François Coste F. Pierre Peterlongo P. Matthias Gallé M. Algorithmique, combinatoire du texte et applications en bio-informatique, Chessy France 2007 http:// hal. inria. fr/ inria-00186352/ en/ Algorithmique, combinatoire du texte et applications en bio-informatique 2007 SeqBio Représentation par jeux du chaos de séquences d'ADN Thibaut Henin T. Algorithmique, combinatoire du texte et applications en bio-informatique, Chessy France 09 2007 http:// hal. archives-ouvertes. fr/ hal-00181675/ en/ Algorithmique, combinatoire du texte et applications en bio-informatique 2007 SeqBio Sur les différentes expressions formelles d'une hiérarchie binaire symétrique ou implicative Israel-César Lerman I.-C. O. Hudry O. I. Charon I. G. Hébrail G. XIV Rencontres de la Société Francophone de Classification, Paris France École Nationale des Télécommunications de Paris École Nationale des Télécommunications de Paris

2007 139-142 http:// hal. inria. fr/ inria-00180133/ en/ Rencontres de la Société Francophone de Classification 14 SFC Protein similarity search with subset seeds on a dedicated reconfigurable hardware Pierre Peterlongo P. Laurent Noé L. Dominique Lavenier D. Gilles Georges G. Julien Jacques J. Gregory Kucherov G. Mathieu Giraud M. Parallel Bio-Computing, Gdansk, Pologne 2007 http:// hal. inria. fr/ inria-00178325/ en/ Workshop on Parallel Computational Biology 2007 PBC Model reduction and model comparison for NFkB signaling Ovidiu Radulescu O. Andrei Zynoviev A. Alain Lilienbaum A. FOSBE'07: Foundations of Systems Biology in Engineering, Stuttgart 2007 International Conference on Foundations of Systems Biology in Engineering 2 FOSBE Qualitative response of interaction networks: application to the validation of biological models Anne Siegel A. Carito Guziolowski C. Philippe Veber P. Ovidiu Radulescu O. Michel Le Borgne M. 6th International Congress on Industrial and Applied Mathematics. ICIAM 07, Zurich Suisse 2007 http:// hal. inria. fr/ inria-00178854/ en/ International Congress on Industrial and Applied Mathematics 6 ICIAM Topological properties of central tiles for substitutions Anne Siegel A. Joerg Thuswaldner J. Journées de Numération, Graz Autriche 2007 http:// hal. inria. fr/ inria-00180246/ en/ Journées de Numération 2007 Détection de domaines dans des séquences génomiques : un problème de couverture optimale Philippe Veber P. Sébastien Tempel S. Rumen Andonov R. Dominique Lavenier D. Jacques Nicolas J. FRANCORO V/ROADEF 2007, Grenoble France 02 2007 http:// hal. archives-ouvertes. fr/ hal-00186471/ en/ Congrès de la Société Française de Recherche Opérationnelle et d'Aide à la Décision 8 ROADEF Détection de domaines dans des séquences génomiques : un problème de couverture optimale. Philippe Veber P. Sébastien Tempel S. Rumen Andonov R. Dominique Lavenier D. Jacques Nicolas J. Francoro / Roadef, Grenoble France Presses Universitaires de Grenoble G-SCOP (FRE 3028)

2007 http:// hal. inria. fr/ inria-00180392/ en/ Congrès de la Société Française de Recherche Opérationnelle et d'Aide à la Décision 8 ROADEF GridR: An R-based grid-enabled tool for data analysis in ACGT clinico-genomics trials Dennis Wegener D. Thierry Sengstag T. Stelios Sfakianakis S. Stefan Rüping S. Anthony Assi A. IEEE 3rd IEEE International Conference on e-Science and Grid Computing, Bangalore Inde 2007 http:// hal. inria. fr/ inria-00179078/ en/ IEEE International Conference on e-Science and Grid Computing 3 e-Science Boundary of central tiles associated with Pisot beta-numeration and purely periodic expansions Shigeki Akiyama S. Guy Barat G. Valérie Berthé V. Anne Siegel A. Research Report, arXiv:0710.3584v1 2007 http:// arxiv. org/ pdf/ 0710. 3584 Recent Advances in Solving the Protein Threading Problem Rumen Andonov R. Guillaume Collet G. Jean-François Gibrat J.-F. Antoine Marin A. Vincent Poirriez V. Nikola Yanev N. Research Report RR-6253 2007 http:// hal. inria. fr/ inria-00165274/ en/ Towards Structural Classification of Proteins based on Contact Map Overlap Rumen Andonov R. Nicola Yanev N. Noël Malod-Dognin N. Research Report RR-6370 2007 http:// hal. inria. fr/ inria-00192206/ en/ A dynamic programming based reduction procedure fr the multidimensional 0-1 knapsack problem Stefan Balev S. Nicola Yanev N. Arnaud Fréville A. Rumen Andonov R. accepted for publication in European Journal of Operational Research, ELSEVIER doi 10.1016/j.ejor.2006.02.058 2007 http:// hal. inria. fr/ inria-00184771/ en/ Logol : Modelling evolving sequence families through a dedicated constrained string language Catherine Belleannée C. Jacques Nicolas J. Research Report RR-6350 2007 http:// hal. inria. fr/ inria-00186568/ en/ Recherche de modules dans le réseau MAPK Laurent Chamoin L. Stage 2007 http:// hal. inria. fr/ inria-00185032/ en/ In-place updating suffix arrays for grammatical inference Matthias Gallé M. Stage 2007 http:// hal. inria. fr/ inria-00186356/ en/ Decision Diagrams for Qualitative Biological Models Michel Le Borgne M. Philippe Veber P. Research Report RR-6182 2007 http:// hal. inria. fr/ inria-00144597/ en/ Parsing avec erreurs par un protomate Jessie Mahé J. Stage 2007 http:// hal. inria. fr/ inria-00185428/ en/ Inferring the role of transcription factors in regulatory networks Philippe Veber P. Carito Guziolowski C. Michel Le Borgne M. Ovidiu Radulescu O. Anne Siegel A. accepted for publication in BMC Bioinformatics Research Report 2007 http:// hal. inria. fr/ inria-00185038/ en/ A Novel Algorithm for Finding Maximum Common Ordered Subgraph Nicola Yanev N. Rumen Andonov R. Research Report RR-6287 2007 http:// hal. inria. fr/ inria-00171780/ en/ BioMAJ : A workflow engine dedicated to biological bank management David. Allouche D. Anthony Assi A. Yoann Beausse Y. Christophe Caron C. Olivier Filangi O. Jean-Marc Larre J.-M. Hugues Leroy H. Véronique Martin V. in JOBIM (Journées Ouvertes Biologie, Informatique et Mathématiques), poster session 2007 http:// hal. inria. fr/ inria-00176506/ en/ Dynamic programming for LR-PCR segmention of bacterium genomes R. Andonov R. D. Lavenier D. N. Yanev N. P. Veber P. HiComb 2004: Third IEEE International Workshop on High Performance Computational Biology, Santa Fe, New Mexico, USA 2004 Detection of multi-stability, bifurcations, and hysteresis in a large class of biological positive-feedback systems J.D. Angeli J. J.E. Jr. Ferrell J. J. E.D. Sontag E. PNAS 2004 1822-1827 The disintegrin and metalloproteinase ADAM12 contributes to TGF-beta signaling through interaction with the type II receptor A Atfi A. E Dumont E. F Colland F. D Bonnier D. A Lhelgoualch A. C Prunier C. N Ferrand N. B Clément B. UM Wewer U. N Théret N. J Cell Biol. 178 2007 201-208 Validation of qualitative models of genetic regulatory networks by model checking: Analysis of the nutritional stress response in Escherichia coli G. Batt G. D. Ropers D. H. de Jong H. J. Geiselmann J. R. Mateescu R. M. Page M. D. Schneider D. Bioinformatics 21 Suppl 1 2005 i19-i28 Revising regulatory networks: from expression data to linear causal models S.D. Bay S. J; Shrager J. A. Pohorille A. P. Langley P. Journal of Biomedical Informatics 35 289-297 2003 GenoFrag: software to design primers optimized for whole genome scanning by long-range PCR amplification N. Ben Zacour N. M. Gautier M. R. Andonov R. D. Lavenier D. P. Veber P. A. Sorokin A. Y. Le Loir Y. Nucleic Acid Research 32 1 2004 MAP KinasePhosphatase As a Locus of Flexibility in a Mitogen-Activated Protein Kinase Signaling Network Bhalla Ram Iyengar Science 297 August 2002 Efficient discovery of conserved patterns using a pattern graph. Alvis Brazma A. I. Jonassen I. I. Eidhammer I. David Gilbert D. Cabios 13 1997 509-522 Approaches to the Automatic Discovery of Patterns in Biosequences Alvis Brazma A. I. Jonassen I. I. Eidhammer I. David Gilbert D. Journal of Computational Biology 5 2 1998 277-304 Fast and sensitive alignment of large genomic sequences M. Brudno M. B. Morgenstern B. Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB) 2002 Findind motifs using random projections J. Buhler J. M. Tampa M. Proceedings of RECOMB01, Montreal, Canada ACM Press 2001 69-76 A Machine Learning Approach to Biochemical reaction Rules Discovery L. Calzone L. N. Chabrier-Rivier N. F. Fages F. S. Soliman S. Proceedings of Foundations of Systems Biology in Engineering'05, Santa-Barbara 2005 Geometric representation of substitutions of Pisot type Vincent Canterini V. Anne Siegel A. Trans. Amer. Math. Soc. 353 12 2001 5121-5144 Modeling and querying biomolecular interaction networks N. Chabrier-Rivier N. M. Chiaverini M. V. Danos V. F. Fages F. V. Schächter V. Theor. Comp. Sci. 325 1 2004 25-44 Qualitative Modelling of Genetic Networks: From Logical Regulatory Graphs to Standard Petri Nets C. Chaouiya C. E. Remy E. P. Ruet P. D. Thieffry D. Lecture Notes in Computer Science 3099 2004 137-156 Robustness and fragility of Boolean models for genetic regulatory networks M. Chaves M. R. Albert R. E.D. Sontag E. J. Theor. Biol. 235 2005 431-449 Biological Information Signal Processor E. Chow E. T. Hunkapiller T. J. Peterson J. ASAP 1991 144-160 A Transformational-Grammar Approach to the Study of The Regulation of Gene Expression J Collado-Vides J. J. Theor. Biol. 13 6 1989 403-425 Modeling and simulation of genetic regulatory Systems: A literature review H. De Jong H. Journal of Computational Biology 9 1 2002 69-105 Qualitative simulation of genetic regulatory networks using piecewise-linear models. H. De Jong H. J.-L. Gouzé J.-L. C. Hernandez C. M. Page M. T. Sari T. J. Geiselmann J. Bulletin of Mathematical Biology 66 2004 301–340 Gene structure prediction by linguistic methods S. Dong S. D. Searls D. Genomics 23 1994 540-551 Prospects for antiparasitic drugs: the case of Trypanasoma brucei, the causative agent of African sleeping sickness R. Eisenthal R. A. Cornish-Bowden A. J. Biol. Chem 272 1998 5500-5505 Understanding the Control of Metabolism D. Fell D. Portland Press, London 1997 Being Bayesian about Network Structure: A Bayesian Approach to Structure Discovery in Bayesian Networks N. Friedman N. D. Koller D. Machine Learning 50 2003 95-126 Twelve numerical, symbolic and hybrid supervised classification methods O. Gascuel O. B. Bouchon-Meunier B. G. Caraux G. P. Gallinari P. A. Guénoche A. Y. Guermeur Y. Y. Lechevallier Y. C. Marsala C. L. Miclet L. J. Nicolas J. R. Nock R. M. Ramdani M. M. Sebag M. B. Tallur B. G. Venturini G. P. Vitte P. Int. J. of Pattern Recognition and Artificial Intelligence 12 5 1998 517-572 Symbolic Reachable Set Computation of Piecewise Affine Hybrid Automata and its Application to Biological Modelling: Delta-Notch Protein Signalling R. Ghoshn R. C. Andomlin C. Systems Biology 1 1 2004 170-183 LASSAP: a LArge Scale Sequence compArison Package, E. Glemet E. JJ. Codani J. Cabios 13 2 1997 137-143 Concentration and spectral robustness of biological networks with hierarchical distribution of time scales A Gorban A. O Radulescu O. European Conference on Complex Systems - ECCS'05, Paris, France nov. 2005 Systolic Filter for fast DNA Similarity Search P. Guerdoux-Jamet P. D. Lavenier D. ASAP'95, International Conference on Application Specific Array Processors, Strasbourg, France 1995 SAMBA: Hardware Accelerator for Biological Sequence Comparison P. Guerdoux-Jamet P. D. Lavenier D. CABIOS 13 6 1997 609-615 Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviours T. Head T. Bull. Math. Biology 49 1987 737-759 The Regulation of Cellular Systems R. Heinrich R. S. Schuster S. Chapman and Hall, New York 1996. BLOCKs database and its applications J.G. Henikoff J. S. Henikoff S. Methods Enzymol. 266 1996 88-105 A comparative analysis of computational motif-detection methods J. Hudak J. M. Mcclure M. Pacific Symposium of Biocomputing PSB 1999 1999 138-139 Dynamic simulation of the human red blood cell metabolic network. N. Jamshidi N. S. Jeremy S. J.S. Edward J. T. Fahland T. G.M. Church G. B.O. Palsson B. Bioinformatics 17 2001 286-287 Stochasticity in gene expression: from theories to phenotypes Mads Kaern M. Timothy A. Elston T. A. William J. Blake W. J. James J. Collins J. J. Nature Rev.Genet. 6 2005 451-464 DNA computing, Sticker systems and universality L. Kari L. G. Paun G. G. Rozenberg G. A. Salomaa A. S. Yu S. Acta Informatica 35 1998 401-420 Eco-Cyc: Encyclopedia of Escerichia Coligenes and metabolism P.D. Karp P. M. Riley M. S.M. Paley S. A. Pellegri A. M. Krummmenacker M. Nucleic Acids Res. 24 1996 32-39 The origin of order, self-organisation and selection in evolution S.A. Kauffman S. Oxford University Press

Oxford, U.K.

1993 Findind motifs in the twilight zone V. Keich V. A.P. Pevzner A. Proceedings of RECOMB02, Washington, USA ACM Press 2002 195-203 On the use of qualitative reasoning to simulate and identify metabolic pathways R.D. King R. S.M. Garrett S. G.M. Coghill G. Bioinformatics 21 9 2005 2017-2026 Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. C. E. Lawrence C. E. S. F. Altschul S. F. M. S. Boguski M. S. J. S. Liu J. S. A. F. Neuwald A. F. J. C. Wootton J. C. Science 262 1993 208-214 Bioinformatics. From genoms to Drugs T. Lengauer T. Wiley-VCH 2002 PatternHunter: Faster And More Sensitive Homology Search B Ma B. J. Tromp J. M. Li M. Bioinformatics 18 3 2002 A general method applicable to the search of similarities in the amino acid sequences of two protein, S.B. Needleman S. C.D. Wunsch C. J. Mol. Biol. 48 1970 443-453 Comparison of network-based pathway analysis methods JA Papin J. J Stelling J. ND Price N. S Klamt S. S Schuster S. BO Palsson B. Trends in Biotechnology 22 2004 400-405 DNA Computing. New Computing Paradigms G. Paun G. G. Rozenberg G. A. Salomaa A. Springer-Verlag 1998 Substitution dynamical systems-spectral analysis M. Queffélec M. Lecture Notes in Mathematics, 1294. Springer-Verlag

Berlin

1987 Hierarchies and modules in complex biological systems O Radulescu O. A Gorban A. A Zinovyev A. European Conference on Complex Systems - ECCS'06, Sabïd Business School, University of Oxford 2006 Modèles mathématiques de la complexité en biologie moléculaire et en mécanique des fluides Ovidiu Radulescu O. Ph. D. Thesis Habilitation à diriger des Recherches. Université de rennes 1 2006 Developing a Logical Model of Yeast Metabolism P.G.K. Reiser P. R.D. King R. D.B. Kell D. S.H. Muggleton S. C.H. Bryant C. S.G. Oliver S. Electronic Transaction in Artificial Intellingence 5 2001 223-244 A Double Combinatorial Approach to Discovering Patterns in Biological Sequences M.-F. Sagot M.-F. A. Viari A. D. S. Hirschberg D. S. E. W. Myers E. W. Proceedings of the7th Annual Symposium on Combinatorial Pattern Matching, Laguna Beach, CA 1075 Springer-Verlag, Berlin 1996 186-208 Recent advances of grammatical inference Y. Sakakibara Y. Theoretical Computer Science 185 1997 15-45 A logical analysis of the Drosophilagap-gene system L. Sanchez L. D. Thieffry D. J. Theor. Biol. 211 115-141 2001 String Variable Grammar: A Logic Grammar Formalism for the Biological Language of DNA David B. Searls D. B. Journal of Logic Programming 24 1/2 1995 73-102 Formal language theory and biological macromolecules D. Searls D. Theoretical Computer Science 47 1999 117-140 Qualitative analysis of the relation between DNA microarray data and behavioral models of regulation network A. Siegel A. O. Radulescu O. M. Le Borgne M. P. Veber P. J. Ouy J. S. Laguarrigue S. BioSystems 84 2006 153-174 http:// dx. doi. org/ 10. 1016/ j. biosystems. 2005. 10. 006 Identification of common molecular subsequences T.F. Smith T. M.S. Waterman M. J. Mol. Biol. 147 198 195-197 Necessary conditions for multistationnarity and stable periodicity E.H. Snoussi E. J. Biol. Syst. 6 1998 1-23 Basic local alignment search tool, D.J. States D. W. Gish W. S.F. Altschul S. J. Mol. Biol. 215 1990 403-410 Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection Y. Tamada Y. S. Kim S. H. Bannai H. S. Imoto S. K. Tashiro K. S. Kuhara S. S. Miyano S. Proceedings of the ECCB'03 conference 2003 E-CELL:software environment of whole-cell simulation M. Tomita M. K. Hashimoto K. K. Takahashi K. T.S. Shimuzu T. Y. Matsuzaki Y. F. Miyoshi F. K. Saito K. S. Tanida S. K. Yugi K. J.C. Venter J. J.C. Hutchinson J. Bioinformatics 15 1999 72-84 Sniffers, buzzers, toggles and blinkers: dynamics of regulatory and signaling pathways in the cell John J. Tyson J. J. C.K. Chen C. Béla Novák B. Curr. Opinion Cell Biol. 15 2003 221-231 BioSCAN: A VLSI-Based System for Biosequence Analysis, C.T. White C. R.K. Singh R. P.B. Reintjes P. J. Lampe J. B.W. Erickson B. W.D. Dettloff W. V.L. Chi V. S.F. Altschul S. IEEE Int. Conf on Computer Design: VLSI in Computer and Processors 1991 504-509 Protein network inference from multiple genomic data: a supervised approach Y. Yamanishi Y. J.-P. Vert J.-P. M. Kanehisa M. Bioinformatics 20 2004 i363 - i370 DNA Evolutionary Linguistics and RNA Structure Modeling : A Computational Approach T. Yokomori T. S. Kobayashi S. Proc.of 1st International IEEE Symposium on Intelligence in Neural and Biological Systems 1995 38-45 A flexible model for protein fold recognition Guillaume Collet G. Nikola Yanev N. Antoine Marin A. Rumen Andonov R. Jean-François Gibrat J.-F. Journées Ouvertes Biologie, Informatique et Mathématiques, Bordeaux France 2006 http:// hal. inria. fr/ inria-00185460/ en/ Learning Automata on Protein Sequences François Coste F. Goulven Kerbellec G. Alain Denise A. Pascal Durrens P. Stéphane Robin S. Eduardo Rocha E. Antoine de Daruvar A. Alexis Groppi A. JOBIM, Bordeaux, France 2006 199–210 http:// hal. inria. fr/ inria-00180429/ en/ Accurate prediction of BRCA1 and BRCA2 heterozygous genotype using expression profiling after induced DNA damage. Zsofia Kote-Jarai Z. Lucy Matthews L. Ana Osorio A. Susan Shanley S. Ian Giddings I. Francois Moreews F. Imogen Locke I. Gareth Evans G. Diana Eccles D. Clinical Cancer Research 2006 http:// hal. inria. fr/ inria-00180301/ en/ Fine-Grained Parallelism for Genomic Computation Dominique Lavenier D. Conference on Parallel Processing for Scientific Computing, États-Unis d'Amérique 02 2006 http:// hal. archives-ouvertes. fr/ hal-00179996/ en/ Seed-based Genomic Sequence Comparison using a FPGA/FLASH Accelerator Dominique Lavenier D. Liu Xinchun L. Gilles Georges G. International IEEE Conference on Field Programmable Technology, Thaïlande 2006 http:// hal. archives-ouvertes. fr/ hal-00179994/ en/ Recherche dans les banques d'ADN par indexation parallèle Van Hoa Nguyen V. H. Dominique Lavenier D. RVIF, HCM Viet Nam 2006 http:// hal. inria. fr/ inria-00180372/ en/ Lagrangian Approaches for a class of Matching Problems in Computational Biology Nicola Yanev N. Rumen Andonov R. Philippe Veber P. Stefan Balev S. Research Report RR-5973 INRIA 2006 http:// hal. inria. fr/ inria-00090635/ en/ BioMAJ : A workflow engine dedicated to biological bank management David. Allouche D. Anthony Assi A. Yoann Beausse Y. Christophe Caron C. Olivier Filangi O. Jean-Marc Larre J.-M. Hugues Leroy H. Véronique Martin V. in JOBIM (Journées Ouvertes Biologie, Informatique et Mathématiques), poster session 2007 http://hal.inria.fr/inria-00176506/en/ Boundary of central tiles associated with Pisot beta-numeration and purely periodic expansions Shigeki Akiyama S. Guy Barat G. Valérie Berthé V. Anne Siegel A. Research Report, arXiv:0710.3584v1 2007 http://arxiv.org/pdf/0710.3584 Recent advances in solving the protein threading problem Rumen Andonov R. Guillaume Collet G. Jean-François Gibrat J.-F. Antoine Marin A. Vincent Poirriez V. Nikola Yanev N. El-Ghazali Talbi E.-G. Albert Y. Zomaya A. Y. Grid Computing for Bioinformatics and Computational Biology Wiley 2007 http://hal.inria.fr/inria-00180695/en/ Recent Advances in Solving the Protein Threading Problem Rumen Andonov R. Guillaume Collet G. Jean-François Gibrat J.-F. Antoine Marin A. Vincent Poirriez V. Nikola Yanev N. Research Report RR-6253 2007 http://hal.inria.fr/inria-00165274/en/ Towards Structural Classification of Proteins based on Contact Map Overlap Rumen Andonov R. Nicola Yanev N. Noël Malod-Dognin N. Research Report RR-6370 2007 http://hal.inria.fr/inria-00192206/en/ Comparaison de deux critères de la CAH sous contrainte de contiguïté spatiale Kaddour Bachar K. Israel-César Lerman I.-C. J-M. Marion J.-M. 39-èmes Journées de Statistique de la SFdS, Angers France Université d'Angers 2007 http://hal.inria.fr/inria-00180155/en/ Logol : Modelling evolving sequence families through a dedicated constrained string language Catherine Belleannée C. Jacques Nicolas J. Research Report RR-6350 2007 http://hal.inria.fr/inria-00186568/en/ Purely Periodic beta-Expansions in the Pisot Non-unit Case Valérie Berthé V. Anne Siegel A. Journal of Number Theory 127 2007 153-172 http://hal.inria.fr/inria-00181997/en/ A dynamic programming based reduction procedure fr the multidimensional 0-1 knapsack problem Stefan Balev S. Nicola Yanev N. Arnaud Fréville A. Rumen Andonov R. accepted for publication in European Journal of Operational Research, ELSEVIER doi 10.1016/j.ejor.2006.02.058 2007 http://hal.inria.fr/inria-00184771/en/ Recherche de modules dans le réseau MAPK Laurent Chamoin L. Stage 2007 http://hal.inria.fr/inria-00185032/en/ Problème d'optimisation de recherche de cliques pour caractériser des familles de protéines François Coste F. Goulven Kerbellec G. Jean Charles Billaut J. C. ROADEF 2007, Grenoble France Universités de Grenoble

2007 157–158 http://hal.inria.fr/inria-00180494/en/ Mise à jour incrémentale de tableau des suffxes en cours de recodage François Coste F. Pierre Peterlongo P. Matthias Gallé M. Algorithmique, combinatoire du texte et applications en bio-informatique, Chessy France 2007 http://hal.inria.fr/inria-00186352/en/ Mapping sequences by parts Gilles Didier G. Carito Guziolowski C. Algorithms for Molecular Biology 2 2007 11 http://hal.inria.fr/inria-00180004/en/ In-place updating suffix arrays for grammatical inference Matthias Gallé M. Stage 2007 http://hal.inria.fr/inria-00186356/en/ Dynamic and static limitation in multiscale reaction networks, revisited Alexander Gorban A. Ovidiu Radulescu O. Chemical Engineering Science 2007 http://hal.inria.fr/inria-00184976/en/ Dynamical robustness of biological networks with hierarchical distribution of time scales Alexander Gorban A. Ovidiu Radulescu O. IET Systems Biology 1 2007 238-246 http://hal.inria.fr/inria-00181450/en/ Checking Consistency Between Expression Data and Large Scale Regulatory Networks: A Case Study Carito Guziolowski C. Philippe Veber P. Michel Le Borgne M. Ovidiu Radulescu O. Anne Siegel A. The Journal of Biological Physics and Chemistry 7 2007 37-43 http://hal.inria.fr/inria-00178914/en/ Représentation par jeux du chaos de séquences d'ADN Thibaut Henin T. Algorithmique, combinatoire du texte et applications en bio-informatique, Chessy France 09 2007 http://hal.archives-ouvertes.fr/hal-00181675/en/ A New Probabilistic Measure of Interestingness for Association Rules, Based on the Likelihood of the Link Israel-César Lerman I.-C. Jérôme Azé J. F. Guillet F. H. Hamilton H. Quality Measures in Data Mining. Studies in Computational Intelligence Springer 2007 207-236 http://hal.inria.fr/inria-00180117/en/ Decision Diagrams for Qualitative Biological Models Michel Le Borgne M. Philippe Veber P. Research Report RR-6182 2007 http://hal.inria.fr/inria-00144597/en/ Sur les différentes expressions formelles d'une hiérarchie binaire symétrique ou implicative Israel-César Lerman I.-C. O. Hudry O. I. Charon I. G. Hébrail G. XIV Rencontres de la Société Francophone de Classification, Paris France École Nationale des Télécommunications de Paris École Nationale des Télécommunications de Paris

2007 139-142 http://hal.inria.fr/inria-00180133/en/ A Reconfigurable Index FLASH Memory tailored to Seed-Based Genomic Sequence Comparison Algorithms Dominique Lavenier D. Gilles Georges G. Xinchun Liu X. The Journal of VLSI Signal Processing 48 2007 255-269 http://hal.inria.fr/inria-00178314/en/ Path-equivalent developments in acycliv weighted automata Dominique Lavenier D. Giraud Mathieu G. Verber Philippe V. International Journal of Foundations of Computer Science (IJFCS) 18 2007 799-812 http://hal.inria.fr/inria-00178293/en/ Representation of Concept Description by Multivalued Taxonomic Preordonance Variables Israel-César Lerman I.-C. Philippe Peter P. P. Brito P. P. Bertrand P. G. Cucumel G. F. de Carvalho F. Selected Contributions in Data Analysis and Classification Studies in Classification, Data Analysis, and Knowledge organization Studies in Classification, Data Analysis, and Knowledge organization Springer 2007 271-284 http://hal.inria.fr/inria-00180101/en/ Parsing avec erreurs par un protomate Jessie Mahé J. Stage 2007 http://hal.inria.fr/inria-00185428/en/ Protein similarity search with subset seeds on a dedicated reconfigurable hardware Pierre Peterlongo P. Laurent Noé L. Dominique Lavenier D. Gilles Georges G. Julien Jacques J. Gregory Kucherov G. Mathieu Giraud M. Parallel Bio-Computing, Gdansk, Pologne 2007 http://hal.inria.fr/inria-00178325/en/ Large Gene Family Expansion and Variable Selective Pressures for Cathepsin B in Aphids Claude Rispe C. Mayako Kutsukabe M. Vincent Doublet V. Sylvie Hudaverdian S. Fabrice Legeai F. Jean-Christophe Simon J.-C. Denis Tagu D. Takema Fukatsu T. Molecular Biology and Evolution 2007 http://hal.inria.fr/inria-00180087/en/ Théorèmes limites pour des processus de Markov à sauts. Synthèse des résultats et applications en biologie moleculaire Ovidiu Radulescu O. Aurélie Muller A. Alina Crudu A. TSI (Technique et Science Informatiques) 26 2007 443-469 http://hal.inria.fr/inria-00181451/en/ Optimiser un plan d'expérience à partir de modèles qualitatifs? Anne Siegel A. Carito Guziolowski C. Philippe Veber P. Ovidiu Radulescu O. Michel Le Borgne M. Biofutur 275 2007 27-31 http://hal.inria.fr/inria-00178791/en/ Qualitative response of interaction networks: application to the validation of biological models Anne Siegel A. Carito Guziolowski C. Philippe Veber P. Ovidiu Radulescu O. Michel Le Borgne M. 6th International Congress on Industrial and Applied Mathematics. ICIAM 07, Zurich Suisse PAAM Volume 7, Issue 1 2007 1121803-1121804 http://hal.inria.fr/inria-00178854/en/ Topological properties of central tiles for substitutions Anne Siegel A. Joerg Thuswaldner J. Journées de Numération, Graz Autriche 2007 http://hal.inria.fr/inria-00180246/en/ Dynamique des hélitrons dans le genome d'Arabidopsis thaliana : développement de nouvelles stratégies d'analyse des éléments transposables Sébastien Tempel S. Ph. D. Thesis Université de Rennes 1 2007 http://tel.archives-ouvertes.fr/tel-00185256/en/ Model-based Identification of Helitrons Results in a New Classification of Their Families in Arabidopsis thaliana Sébastien Tempel S. Jacques Nicolas J. Abdelhak El Amrani A. Ivan Couée I. Gene 403 2007 http://hal.inria.fr/inria-00180376/en/ Modélisation grande échelle de réseaux biologiques :vérification par contraintes booléennes de la cohérence des données Philippe Veber P. Ph. D. Thesis Université de Rennes 1 2007 http://tel.archives-ouvertes.fr/tel-00185895/en/ Inferring the role of transcription factors in regulatory networks Philippe Veber P. Carito Guziolowski C. Michel Le Borgne M. Ovidiu Radulescu O. Anne Siegel A. accepted for publication in BMC Bioinformatics Research Report 2007 http://hal.inria.fr/inria-00185038/en/ Détection de domaines dans des séquences génomiques : un problème de couverture optimale Philippe Veber P. Sébastien Tempel S. Rumen Andonov R. Dominique Lavenier D. Jacques Nicolas J. FRANCORO V/ROADEF 2007, Grenoble France 02 2007 http://hal.archives-ouvertes.fr/hal-00186471/en/ Détection de domaines dans des séquences génomiques : un problème de couverture optimale. Philippe Veber P. Sébastien Tempel S. Rumen Andonov R. Dominique Lavenier D. Jacques Nicolas J. Francoro / Roadef, Grenoble France Presses Universitaires de Grenoble G-SCOP (FRE 3028)

2007 http://hal.inria.fr/inria-00180392/en/ GridR: An R-based grid-enabled tool for data analysis in ACGT clinico-genomics trials Dennis Wegener D. Thierry Sengstag T. Stelios Sfakianakis S. Stefan Rüping S. Anthony Assi A. IEEE 3rd IEEE International Conference on e-Science and Grid Computing, Bangalore Inde 2007 http://hal.inria.fr/inria-00179078/en/ A Novel Algorithm for Finding Maximum Common Ordered Subgraph Nicola Yanev N. Rumen Andonov R. Research Report RR-6287 2007 http://hal.inria.fr/inria-00171780/en/