The aim of the project MODBIO is to develop computational models for molecular and cell biology. We are focusing on two types of problems:

Determining the structure of biological macromolecules,

Discovering and understanding the function of biological systems.

We approach these questions by combining techniques from constraint programming, combinatorial optimization, hybrid systems, and machine learning.

Sequence and structural alignment, phylogeny.

Protein structure prediction and protein docking.

Modeling alternative splicing.

Participation in the "Génopole Strasbourg Alsace-Lorraine"

Participation in the Bioinformatics project of the Région Lorraine

Participation in the ACI project GENOTO3D

Participation in the "Décrypthon" programme

Various national and international collaborations

Laboratoire "Maturation des ARN et Enzymologie Moléculaire" (MAEM), UMR 7567, Nancy

Laboratoire de Cristallographie, LCM3B, Nancy

Institut de Biologie et Chimie des Protéines, IBCP, Lyon

DFG Research Center Matheon, Berlin, Germany

Institute for Genomics and Bioinformatics, University of California, Irvine, USA

In
*constraint programming*, the user programs with constraints, i.e., he or she describes a problem by a set of constraints, which are connected by
*combinators*such as conjunction, disjunction, or temporal operators (
`always`). Each constraint gives some
*partial*information about the state of the system to be studied. Constraint programming systems allow one to deduce new constraints from the given ones and to compute
*solutions*, i.e., values for the variables that satisfy all constraints simultaneously.

One of the main goals of constraint programming is to develop programming languages that allow one to express constraint problems in a natural way, and to solve them efficiently.

In our work, we are first interested in constraint problems over finite domains. In this case, the domain of each variable (the set of values it may take) is a finite set of integer
numbers. Theory tells us that most constraint problems over finite domains are NP-hard, which means that there is little hope to solve them by algorithms polynomial in the size of the input.
In practice, these problems are handled by tree search methods which try successively different valuations of the variables until a solution is found. Because of the exponential number of
possible combinations, it is crucial to reduce the search space as much as possible, i.e., to eliminate
*a priori*as many valuations as possible.

There exist two generic methods to solve such problems. The first one is classical
*integer linear programming*(see also Sect.
), which has been studied in mathematical programming and
operations research for more than 40 years. Here, constraints are linear equations and inequalities over the integer numbers. In order to reduce the search space, one typically uses the
linear relaxation of the constraint set. Equations and inequalities are first solved over the real numbers, which is much easier; then the information obtained is used to prune the search
tree.

The second method is
*finite domain constraint programming*which arose in the last 15 years by combining ideas from declarative programming languages and constraint satisfaction techniques in artificial
intelligence. In contrast to integer linear optimization one uses, in addition to simple arithmetic constraints, more complex constraints, which are called
*symbolic constraints*. For instance, the symbolic constraint
alldifferent(
x_{1}, ...,
x_{n})expresses that the values of the variables
x_{1}, ...,
x_{n}must be pairwise distinct. Such a constraint is difficult to express in a compact way using only linear equations and inequalities. Symbolic constraints are handled individually
by specific filtering algorithms that reduce the domain of the variables. This information is propagated to other constraints which may further reduce the domains.

A state-of-the-art survey of finite domain constraint programming, with special emphasis on its relation to integer linear programming can be found in .

In
*concurrent*constraint programming (cc)
, different computation processes may run
concurrently. Interaction is possible via the
*constraint store*. The store contains all the constraints currently known about the system. A process may
*tell*the store a new constraint, or
*ask*the store whether some constraint is entailed by the information currently available, in which case further action is taken.

*Hybrid*concurrent constraint programming (
`Hybrid cc`)
is an extension of concurrent constraint
programming which allows one to model and to simulate the temporal evolution of
*hybrid systems*, i.e., systems that exhibit both discrete and continuous state changes. Constraints in
`Hybrid cc`may be both algebraic and differential equations. State changes can be specified using the combinators of concurrent constraint programming and default logic.
`Hybrid cc`is well-suited to model dynamic biological systems, as shown in
.

Statistical learning theory is one of the fields of inferential statistics the bases of which have been established by V.N. Vapnik in the late 1960s. The goal of this theory is to specify the conditions under which it is possible to "learn" from empirical data obtained by random sampling. Learning amounts to solving a problem of function or model selection. Basically, given a task characterized by a joint probability distribution on pairs made up of observations and labels, and a class of functions, of cardinality ordinarily infinite, the goal is to find in the class a function with optimal performance. Training can thus be reformulated as an optimization problem. In many cases, the objective function is related to the capacity of the class of functions . The learning tasks considered belong to one of the three following areas: pattern recognition (discriminant analysis), function approximation (regression) and density estimation.

This theory considers more specifically two inductive principles. The first one, named empirical risk minimization (ERM) principle, consists in minimizing the training error. If the sample is small, one substitutes to this the structural risk minimization (SRM) principle. It consists in minimizing an upper bound on the expected risk (generalization error), a bound sometimes called a guaranteed risk. This latter principle is implemented in the training algorithms of the support vector machines (SVMs), which currently constitute the state-of-the-art for numerous problems of pattern recognition.

SVMs are connectionist models conceived to compute indicator functions, to perform regression or to estimate densities. They have been introduced during the last decade by Vapnik and co-workers , as nonlinear extensions of the maximal margin hyperplane . Their main advantage is that they can avoid overfitting in the case where the size of the sample is small , .

``Combinatorial optimization is a lively field of applied mathematics, combining techniques from combinatorics, linear programming, and the theory of algorithms, to solve optimization
problems over discrete structures''
. A combinatorial optimization problem can be
defined as follows: we are given a ground set
Nand consider a finite collection of subsets, say
. For each subset
S_{k}there is an objective function value,
f(
S_{k}), typically a linear function over the elements in
S_{k}. The task is to find the subset
S_{k}that minimizes
f(
S_{k}). Typically, the feasible subsets are represented by inclusion or exclusion of members such that they satisfy certain conditions. Well known examples of combinatorial optimization
problems are assignment, covering, cutting stock, knapsack, matching, packing, partitioning, routing, sequencing, scheduling (jobs), shortest path, spanning tree, and traveling salesman
problems.

This then becomes a special class of integer programs (IP) whose decision variables are binary valued:
x_{i}= 1if the
i-th element is in the optimal solution; otherwise,
x_{i}= 0. In this case, feasible subsets have to be expressed by linear constraints. IP formulations are not always easy, and often there is more than one formulation, some better than
others. Many good formulations have exponential size.

Molecular biology is concerned with the study of three types of biological macromolecules: DNA, RNA, and proteins. Each of these molecules can initially be viewed as a string on a finite alphabet: DNA and RNA are nucleic acids made up of nucleotides A,C,G,T and A,C,G,U, respectively. Proteins are sequences of amino acids, which may be represented by an alphabet of 20 letters.

Molecular biology studies the information flow from DNA to RNA, and from RNA to proteins. In a first step, called
*transcription*, a DNA string (``gene'') is transcribed into messenger RNA (mRNA). In the second step, called
*translation*, the mRNA is translated into a protein: each triplet of nucleotides encodes one amino acid according to the genetic code. The genes of eukaryotic cells are mostly composed of
a succession of coding regions, called exons, and non-coding regions, called introns. During transcription, an intermediate step, the
*splicing*process, is then necessary to remove the introns from the premessenger RNA. The remaining exons are concatenated yielding the mature RNA molecule.
*Alternative splicing*is a regulatory mechanism by which variations in the incorporation of the exons into mRNA leads to the production of different forms of mature mRNAs and consequently
to more than one related protein, or isoform.

Biological macromolecules are not just sequences of nucleotides or amino acids. Actually, they are complex three-dimensional objects. DNA shows the famous double-helix structure. RNA and
proteins fold into complex three-dimensional structures, which depend on the underlying sequence. RNA is a single-stranded chain of nucleotides. However, a nucleotide in one part of the
molecule can base-pair with a nucleotide in another part, following the Watson-Crick complementarity rules. This results in a folding of the molecule. The
*secondary structure*of RNA indicates the set of base pairings in the three dimensional structure of the molecule. This information can be represented by a graph.

Proteins have several levels of structure. Above the primary structure (i.e. the sequence) is the
*secondary structure*, which involves three basic types:
*-helices*,
*-sheets*, and aperiodic structure elements called
*loops*. The spatial relationship of the secondary structures froms the tertiary structure. Several proteins can function together in a protein complex whose structure is referred to as
the quaternary structure. A
*domain*of a protein is a combination of secondary structure elements with some specific function. It contains an
*active site*where an interaction with an external molecule may happen. A protein may have one or several domains.

The ultimate goal of molecular biology is to understand the
*function*of biological macromolecules in the life of the cell. Function results from the
*interaction*between different macromolecules, and depends on their structure. The overall challenge is to make the leap from sequence to function, through structure: the prediction of
structure will help to predict the function.

Thanks to the huge number of gene and protein sequences available in the sequence databases, molecular phylogenetic analyses multiplied since a few decades. Molecular phylogeny is the use of genes or protein sequences to gain information on the evoluationary history of organisms. By comparison of the sequence of a gene in different organisms, the evolutionary history of these sequences can be inferred. Based on the hypothesis that these sequences are orthologs (i.e. come from a same ancestral sequence by speciation events), the evolutionary history of the organisms can also be inferred and be represented by a tree.

We have extended the functionalities and optimized the code of the application devoted to the standard M-SVM (M-SVM1 in ), and its variant dedicated to protein sequence processing .

We have developed a first version of a new multi-class discriminant model based on modular task decomposition and bi-class SVMs (DSVM in ). This version is available at the following address: http://www710.univ-lyon1.fr/~kbenabde/index_fichiers/Page902.htm.

We have continued our study of the generalization error of large margin multi-class discriminant models, such as MLPs or M-SVMs, laying emphasis on the use of bounds for model selection. The
introduction of new generalizations of the notion of Vapnik-Chervonenkis (VC) dimension, generalizations called margin
-dimensions, has enabled us to complete the VC theory of large margin multi-class discriminant models
. In parallel, the work on the computation of
upper bounds on the empirical risk of M-SVMs based on the leave-one-out procedure has given birth to new results exposed in
. All the aforementioned bounds are
progressively incorporated in our M-SVM software, where they can be used to select the "soft margin" (regularization) parameter
C.

We have also written a survey on multi-class SVMs .

Knowing the three-dimensional structure of a protein can greatly help to infer its function. Predicting this
*tertiary structure*from the sequence of amino acids (or
*primary structure*), remains one of the central open problems in structural biology. This is the subject of the "GENOTO3D" project that we coordinate. This year, our main efforts have
been concentrated on identifying the fold class of protein sequences of unknown structure. To that end, in collaboration with Gilbert Deléage and Christophe Geourjon, at IBCP, in Lyon, Khalid
Benabdeslem has developed an original approach for treating 3D structures of proteins. This approach consists in generating a significant data set for automatic learning and processing a
modeling system for fold recognition. In the first step, the method consists in computing a structural alignment for each class (family) of structures using the Combinational Extension (CE)
alignment methodology. In the second step, the method derives from each alignment matrix a taxonomy created with AHC. Then, structural cores are extracted from all families and each core
represents a prototype for each class of structures
. Finally, a neural network is built from a
matricial modeling based sequence coding for protein fold recognition
,
. Subsequently, with the help of cores
extraction, a data set is generated to build a strong fold recognition system with accuracy exceeding 72% over 100 CATH families
.

Our collaboration with Nicolas Sapay and Gilbert Deléage on the prediction of amphipathic in-plane membrane (IPM) anchors in motopic proteins has given birth to a new prediction method, "AmphipaSeek" , which is available from the website of the PBIL, at the following address : http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_amphipaseek.html.

The first step to infer the evolutionary history of gene or protein sequences consists in building an alignment of all these sequences, i.e. to determine the homology (common ancestry) at each site of the sequences. To that end, biologists generally use the algorithm provided by the clustalw programme. This algorithm is based on the computation of a distance between each pair of sequences, distance which makes use of statistical models of DNA/protein evolution. We have started investigating the interest to substitute a kernel to this distance.

We participate in the "Génopole Strasbourg Alsace-Lorraine" together with the laboratory MAEM and the IGBMC in Strasbourg.

In the framework of the CPER Lorraine 2000-2006, we participate in the project "Bioinformatics and Applications to Genomics" of the PRST "Intelligence Logicielle". Our partners here are the Laboratory of Crystallography LCM3B (UMR 7036), the "équipe de Dynamique des Assemblages Membranaires" (eDAM, UMR 7565) and the MAEM (UMR 7567) at the University Henri Poincaré, Nancy 1.

Since September 2003, we are coordinating a project called GENOTO3D, which is funded by the "Action Concertée Incitative" (ACI) "Masses de Données". The aim of this project is to apply machine learning approaches to the prediction of the tertiary structure of globular proteins. Our partners are the IBCP in Lyon, the LIF in Marseille, the project team SYMBIOSE from IRISA, the LIRMM in Montpellier, and the MIG unit of INRA in Jouy-en-Josas.

Since October 2005, we participate in the project "Développement et utilisation d'approches informatiques et théoriques pour l'analyse des liens existant entre défauts d'épissage et maladies génétiques" funded by the Décrypthon programme: http://www.decrypthon.fr/. Our partner in this project is the MAEM laboratory.

Yann Guermeur has been a member of the program committee of CAp'06. He is an expert for the ANR.

Fabienne Thomarat is Associate Professor at the Ecole Nationale Supérieure des Mines de Nancy / Institut National Polytechnique de Lorraine (engineering school, master of engineering school). She is in charge of one option (Bioinformatics) at the Department of Computer Science.

Yann Guermeur has been teaching bioinformatics in the M2P speciality "Génomique et Informatique" of the Master "Sciences de la Vie et de la Santé" (SVS), at the UHP.