**Computational Biology and Computational Structural
Biology.**Understanding the lineage between species and the
genetic drift of genes and genomes, apprehending the control
and feed-back loops governing the behavior of a cell, a
tissue, an organ or a body, and inferring the relationship
between the structure of biological (macro)-molecules and
their functions are amongst the major challenges of modern
biology. The investigation of these challenges is supported
by three types of data: genomic data, transcription and
expression data, and structural data.

Genetic data feature sequences of nucleotides on DNA and
RNA molecules, and are symbolic data whose processing falls
in the realm of Theoretical Computer Science: dynamic
programming, algorithms on texts and strings, graph theory
dedicated to phylogenetic problems. Transcription and
expression data feature evolving concentrations of molecules
(RNAs, proteins, metabolites) over time, and fit in the
formalism of discrete and continuous dynamical systems, and
of graph theory. The exploration and the modeling of these
data are covered by a rapidly expanding research field termed
*systems biology*. Structural data encode informations
about the
3
dstructures of molecules (nucleic
acids, proteins, small molecules) and their interactions, and
come from three main sources: X ray crystallography, NMR
spectroscopy, cryo Electron Microscopy. Ultimately,
structural data should expand our understanding of how the
structure accounts for the function of macro-molecules —one
of the central questions in structural biology. This goal
actually subsumes two equally difficult challenges, which are
*folding*—the process through which a protein adopts its
3
dstructure, and
*docking*—the process through which two or several
molecules assemble. Folding and docking are driven by non
covalent interactions, and for complex systems, are actually
inter-twined
. Apart from the bio-physical
interests raised by these processes, two different
application domains are concerned: in fundamental biology,
one is primarily interested in understanding the machinery of
the cell; in medicine, applications to drug design are
developed.

**Modeling in Computational Structural Biology.**Acquiring
structural data is not always possible: NMR is restricted to
relatively small molecules; membrane proteins do not
crystallize, etc. As a matter of fact, while the order of
magnitude of the number of genomes sequenced is one thousand,
the Protein Data Bank contains (a mere) 45,000 structures.
(Because one gene may yield a number of proteins through
splicing, it is difficult to estimate the number of proteins
from the number of genes. However, the latter is several
orders of magnitudes beyond the former.) For these reasons,
*molecular modeling*is expected to play a key role in
investigating structural issues.

Ideally, bio-physical models of macro-molecules should resort to quantum mechanics. While this is possible for small systems, say up to 50 atoms, large systems are investigated within the framework of the Born-Oppenheimer approximation which stipulates the nuclei and the electron cloud can be decoupled. Example force fields developed in this realm are AMBER, CHARMM, OPLS. Of particular importance are Van der Waals models, where each atom is modeled by a sphere whose radius depends on the atom chemical type. From an historical perspective, Richards , and later Connolly , while defining molecular surfaces and developing algorithms to compute them, established the connexions between molecular modeling and geometric constructions. Remarkably, a number of difficult problems (e.g. additively weighted Voronoi diagrams) were touched upon in these early days.

The models developed in this vein are instrumental in
investigating the interactions of molecules for which no
structural data is available. But such models often fall
short from providing complete answers, which we illustrate
with the folding problem. On one hand, as the conformations
of side-chains belong to discrete sets (the so-called
rotamers or rotational isomers)
, the number of distinct
conformations of a poly-peptidic chain is exponential in the
number of amino-acids. On the other hand, Nature folds
proteins within time scales ranging from milliseconds to
hours, which is out of reach for simulations. The fact that
Nature avoids the exponential trap is known as Levinthal's
paradox. The intrinsic difficulty of problems calls for
models exploiting several classes of informations. For small
systems,
*ab initio*models can be built from first principles.
But for more complex systems,
*homology*or template-based models integrating a
variable amount of knowledge acquired on similar systems are
resorted to.

The variety of approaches developed are illustrated by the
two community wide experiments CASP (
*Critical Assessment of Techniques for Protein Structure
Prediction*;
http://
*Critical Assessment of Prediction of Interactions*;
http://

As illustrated by the previous discussion, modeling macro-molecules touches upon biology, physics and chemistry, as well as mathematics and computer science. In the following, we present the topics investigated within ABS.

We achieved significant results on the problem of modeling macro-molecular complexes, at different scales.

Concerning the atomic-level modeling
of binary complexes, we officially released
`Intervor`
, a program implementing
structural descriptors improving the description of protein -
protein interfaces
,
. Thanks to the aforementioned
Bioinformatics paper and the web-site
http://

In fact, we expect
`Intervor`to become one of the standard tools to model
protein interfaces, and we also look forward to specific
developments for the important class of antibody-antigen
complexes.

Concerning the intermediate resolution
modeling of large protein assemblies, we made a stride
on the problem of making a
precise assessment of
*fuzzy / uncertain*models reconstructed from data
integration
. In a nutshell, the
reconstruction of assemblies involving hundreds of
polypeptide chains is inherently challenging due to
uncertainties on the various data involved in such a task. We
proposed a framework allowing one to inherently model with
uncertainties. The framework consists of replacing a possibly
erroneous fixed shape by a one-parameter family of shapes.
This family consists of a finite collection of so-called
*toleranced balls*, a toleranced ball being a ball whose
radius can be interpolated within a prescribed interval. In
particular, mining
*stable*features of geometric domains in this
one-parameter family hints at important bio-physical
structures. This work puts us in position to make a precise
assessment of putatives models of the Nuclear Pore Complex
(NPC)
, the largest protein assembly
known to date in eucaryotic cells.

From a geometric and topological standpoint, the methodology consists of the -shapes associated to a compoundly-weighted Voronoi diagram — whose bisectors are degree-four algebraic surfaces.

The research conducted by ABSfocuses on two main directions in Computational Structural Biology (CSB), each such direction calling for specific algorithmic developments. These directions are:

- Modeling interfaces and contacts,

- Modeling the flexibility of macro-molecules.

**Problems addressed.**The Protein Data Bank,
http://
*interacting*with atoms of the second one. Understanding
the structure of interfaces is central to understand
biological complexes and thus the function of biological
molecules
. Yet, in spite of almost three
decades of investigations, the basic principles guiding the
formation of interfaces and accounting for its stability are
unknown
. Current investigations follow
two routes. From the experimental perspective
, directed mutagenesis enables
one to quantify the energetic importance of residues,
important residues being termed
*hot*residues. Such studies recently evidenced the
*modular*architecture of interfaces
. From the modeling perspective,
the main issue consists of guessing the hot residues from
sequence and/or structural informations
.

The description of interfaces is also of special interest
to improve
*scoring functions*. By scoring function, two things are
meant: either a function which assigns to a complex a
quantity homogeneous to a free energy change
G=
H-
TS, with
H=
U+
PV.
Gis minimum at an equilibrium, and differences in
Gdrive chemical reactions.p_{i}(
r)the probability of two atoms
–defining type
i– to be located at distance
r, the (free) energy assigned to the pair is computed
as
E_{i}(
r) = -
kTlog
p_{i}(
r). Estimating from the PDB one
function
p_{i}(
r)for each type of pair of atoms,
the energy of a complex is computed as the sum of the
energies of the pairs located within a distance threshold
,
. To compare the energy thus
obtained to a reference state, one may compute
, with
p_{i}the observed frequencies, and
q_{i}the frequencies stemming from an a priori model
. In doing so, the energy defined
is nothing but the Kullback-Leibler divergence between the
distributions
{
p
_{i}}and
{
q
_{i}}.

**Methodological developments.**Describing interfaces
poses problems in two settings: static and dynamic.

In the static setting, one seeks the minimalist geometric model providing a relevant bio-physical signal. A first step in doing so consists of identifying interface atoms, so as to relate the geometry and the bio-chemistry at the interface level . To elaborate at the atomic level, one seeks a structural alphabet encoding the spatial structure of proteins. At the side-chain and backbone level, an example of such alphabet is that of . At the atomic level and in spite of recent observations on the local structure of the neighborhood of a given atom , no such alphabet is known. Specific important local conformations are known, though. One of them is the so-called dehydron structure, which is an under-desolvated hydrogen bond —a property that can be directly inferred from the spatial configuration of the carbons surrounding a hydrogen bond .

A structural alphabet at the atomic level may be seen as
an alphabet featuring for an atom of a given type all the
conformations this atom may engage into, depending on its
neighbors. One way to tackle this problem consists of
extending the notions of molecular surfaces used so far, so
as to encode multi-body relations between an atom and its
neighbors
. In order to derive such
alphabets, the following two strategies are obvious. On one
hand, one may use an encoding of neighborhoods based on
geometric constructions such as Voronoi diagrams (affine or
curved) or arrangements of balls. On the other hand, one may
resort to clustering strategies in higher dimensional spaces,
as the
pneighbors of a given atom are represented by
3
p-6degrees of freedom —the
neighborhood being invariant upon rigid motions.

In the dynamic setting, one wishes to understand whether selected (hot) residues exhibit specific dynamic properties, so as to serve as anchors in a binding process . More generally, any significant observation raised in the static setting deserves investigations in the dynamic setting, so as to assess its stability. Such questions are also related to the problem of correlated motions, which we discuss next.

**Problems addressed.**Proteins in vivo vibrate at various
frequencies: high frequencies correspond to small amplitude
deformations of chemical bonds, while low frequencies
characterize more global deformations. This flexibility
contributes to the entropy thus the
`free energy`of the system
*protein - solvent*. From the experimental standpoint,
NMR studies and Molecular Dynamics simulations generate
ensembles of conformations, called
`conformers`. Of particular interest while
investigating flexibility is the notion of correlated motion.
Intuitively, when a protein is folded, all atomic movements
must be correlated, a constraint which gets alleviated when
the protein unfolds since the steric constraints get relaxed
*diffusion - conformer selection - induced fit*complex
formation model.

Parameterizing these correlated motions, describing the corresponding energy landscapes, as well as handling collections of conformations pose challenging algorithmic problems.

**Methodological developments.**At the side-chain level,
the question of improving rotamer libraries is still of
interest
. This question is essentially a
clustering problem in the parameter space describing the
side-chains conformations.

At the atomic level, flexibility is essentially investigated resorting to methods based on a classical potential energy (molecular dynamics), and (inverse) kinematics. A molecular dynamics simulation provides a point cloud sampling the conformational landscape of the molecular system investigated, as each step in the simulation corresponds to one point in the parameter space describing the system (the conformational space) . The standard methodology to analyze such a point cloud consists of resorting to normal modes. Recently, though, more elaborate methods resorting to more local analysis , to Morse theory and to analysis of meta-stable states of time series have been proposed.

Given a sampling on an energy landscape, a number of fundamental issues actually arise: how does the point cloud describe the topography of the energy landscape (a question reminiscent from Morse theory)? can one infer the effective number of degrees of freedom of the system over the simulation, and is this number varying? Answers to these questions would be of major interest to refine our understanding of folding and docking, with applications to the prediction of structural properties. It should be noted in passing that such questions are probably related to modeling phase transitions in statistical physics where geometric and topological methods are being used .

From an algorithmic standpoint, such
questions are reminiscent of
*shape learning*. Given a collection of samples on an
(unknown)
*model*,
*learning*consists of guessing the model from the
samples —the result of this process may be called the
*reconstruction*. In doing so, two types of guarantees
are sought: topologically speaking, the reconstruction and
the model should (ideally!) be isotopic; geometrically
speaking, their Hausdorff distance should be small. Motivated
by applications in Computer Aided Geometric Design, surface
reconstruction triggered a major activity in the
Computational Geometry community over the past ten years
. Aside from applications,
reconstruction raises a number of deep issues: the study of
distance functions to the model and to the samples, and their
comparison
; the study of Morse-like
constructions stemming from distance functions to points
; the analysis of topological
invariants of the model and the samples, and their comparison
,
.

Last but not least, gaining insight on such questions
would also help to effectively select a reduced set of
conformations best representing a larger number of
conformations. This selection problem is indeed faced by
flexible docking algorithms that need to maintain and/or
update collections of conformers for the second stage of the
*diffusion - conformer selection - induced fit*complex
formation model.

In collaboration with S. Loriot, from the Geometry Factory.

Modeling the interfaces of macro-molecular complexes is key to improve our understanding of the stability and specificity of such interactions. We proposed a simple parameter-free model for macro-molecular interfaces, which enables a multi-scale investigation —from the atomic scale to the whole interface scale. As discussed in and , this interface model improves the state-of-the-art to (i) identify interface atoms, (ii) define interface patches, (iii) assess the interface curvature, (iv) investigate correlations between the interface geometry and water dynamics / conservation patterns / polarity of residues.

The corresponding software,
*Intervor*, has been made available to the community
from the web site
http://

In collaboration with S. Loriot, from the Geometry Factory.

Molecular surfaces and volumes are paramount to molecular modeling, with applications to electrostatic and energy calculations, interface modeling, scoring and model evaluation, pocket and cavity detection, etc. However, for molecular models represented by collections of balls (Van der Waals and solvent accessible models), such calculations are challenging in particular regarding numerics. Because all available programs are overlooking numerical issues, which in particular prevents them from qualifying the accuracy of the results returned, we developed the first certified algorithm. The corresponding piece of code uses so-called certified predicates to guarantee the branching operations of the program, as well as interval arithmetic to return an interval certified to contain the exact value of each statistic of interest—in particular the exact surface area and the exact volume of the molecular model processed. (As of December 2010, the corresponding publication is under revision.)

The corresponding software,
*Vorlume*, has been made available to the community
from the web site
http://

In collaboration with N. Yanev, University of Sofia, and IMI at Bulgarian Academy of Sciences, Bulgaria, and R. Andonov, INRIA Rennes - Bretagne Atlantique, and IRISA/University of Rennes 1, France.

Structural similarity between proteins provides significant insights about their functions. Maximum Contact Map Overlap maximization (CMO) received sustained attention during the past decade and can be considered today as a credible protein structure measure. This paper presents A_purva, an exact CMO solver that is both efficient (notably faster than the previous exact algorithms), and reliable (providing accurate upper and lower bounds of the solution). These properties make it applicable for large-scale protein comparison and classification.

The software is made available from
http://

Describing macro-molecular interfaces is key to improve our understanding of the specificity and of the stability of macro-molecular interactions, and also to predict complexes when little structural information is known. Ideally, an interface model should provide easy-to-compute geometric and topological parameters exhibiting a good correlation with important bio-physical quantities. It should also be parametric and amenable to comparisons. In this spirit, we developed an interface model based on Voronoi diagrams, which proved instrumental to refine state-of-the-art conclusions and provide new insights .

This work formally presents this Voronoi interface model. First, we discuss its connexion to classical interface models based on distance cut-offs and solvent accessibility. Second, we develop the geometric and topological constructions underlying the Voronoi interface, and design efficient algorithms based on the Delaunay triangulation and the -complex.

We conclude with perspectives. In particular, we expect the Voronoi interface model to be particularly well suited for the problem of comparing interfaces in the context of large-scale structural studies.

Dealing with ambiguous data is a challenge in Science in general and geometry processing in particular. One route of choice to extract information from such data consists of replacing the ambiguous input by a continuum, typically a one-parameter family, so as to mine stable geometric and topological features within this family. This work follows this spirit and introduces a novel framework to handle 3D ambiguous geometric data which are naturally modeled by balls .

First, we introduce
*toleranced balls*to model ambiguous geometric
objects. A toleranced ball consists of two concentric
balls, and interpolating between their radii provides a way
to explore a range of possible geometries. We propose to
model an ambiguous shape by a collection of toleranced
balls, and show that the aforementioned radius
interpolation is tantamount to the growth process
associated with an additively-multiplicatively weighted
Voronoi diagram (also called compoundly weighted or CW).
Second and third, we investigate properties of the CW
diagram and the associated CW
-complex, which provides a filtration called the
-complex. Fourth, we sketch a naive algorithm to
compute the CW VD. Finally, we use the
-complex to assess the quality of models of large
protein assemblies, as these models inherently feature
ambiguities.

This CGL project, see
http://

High dimensional geometric data are ubiquitous in science and engineering, and thus processing and analyzing them is a core task in these disciplines. The Computational Geometric Learning project (CG Learning or CGL) aims at extending the success story of geometric algorithms with guarantees, as achieved in the CGAL library and the related EU funded research projects, to spaces of high dimensions. This is not a straightforward task. For many problems, no efficient algorithms exist that compute the exact solution in high dimensions. This behavior is commonly called the curse of dimensionality.

We plan to address the curse of dimensionality by focusing on inherent structure in the data like sparsity or low intrinsic dimension, and by resorting to fast approximation algorithms. The following two kinds of approximation guarantee are particularly desirable: first, the solution approximates an objective better if more time and memory resources are employed (algorithmic guarantee), and second, the approximation gets better when the data become more dense and/or more accurate (learning theoretic guarantee). To lay the foundation of a new field—computational geometric learning—we will follow an approach integrating both theoretical and practical developments, the latter in the form of the construction of a high quality software library and application software.

In this context, the contribution of
ABSlies in the
work-package
*Modeling high-dimensional geometric structures in
science and engineering*, and is concerned with the
investigation of so-called
*energy landscapes*, which are hyper-surfaces
describing the energetic behavior of macro-molecular
systems as a function of conformational variables.

– F.Cazals was member of the following PC:

Symposium on Geometry Processing

International conference on Pattern Recognition in Bioinformatics

– F.Cazals acted as
*rapporteur*for the following PhD thesis defenses:

Duc Thanh Le, University of
Toulouse, October 2010,
*Rapporteur*. Thesis subject:
*(Dis)assembly path planning for complex objects and
applications to structural biology*. Advisors: T.
Siméon and J. Cortès.

Noël Malod-Dognin, Univ. of Rennes,
January 2010,
*Rapporteur*. Thesis subject:
*Protein structure comparison : From Contact Map
Overlap Maximisation to Distance-Based Alignment Search
Tool*Advisor: R. Andonov.

Mathias Carlen, EPFL, January 2010,
*Rapporteur*. Thesis subject:
*Computation and Visualization of Ideal Knot
Shapes*Advisor: J. Maddocks.

– F. Cazals has been appointed in the
scientific committe of
*GDR Bio-informatique-Moléculaire*, in charge of
activities related to computational structural biology.

F. Cazals is co-coordinator or the
*Master of Science in Computational Biology*,
University of Nice - Sophia-Antipolis. This master provides
an advanced curriculum at the interface of biology,
computer science and applied mathematics, and is geared
towards an international audience. See
http://

– Ecole Centrale Paris, 3rd year (master); Introduction to Computational Structural Biology; F. Cazals, 24h.

– University of Nice - Sophia-Antipolis, Master of Science in Computational Biology; Algorithmic Problems in Computational Structural Biology; F. Cazals, 24h.

Internship proposals can be seen on the web from the
*Positions*section at
http://

– Christine-Andrea Roth, MVA Cachan;
Master internship:
*Designing collective coordinates*; Advisor: F.
Cazals; Co-advisor: C. Robert, IBPC Paris.

– Muhammad Ammad Ud Din; MSc
Computational Biology, Univ. of Nice; Master internship:
*Modeling antibody / antigen binding patches*;
Advisor: F. Cazals.

– Achin Bansal, IIT Bombay; Summer
internship:
*Modeling protein binding patches*; Advisor: F.
Cazals.

– Palak Dalal, IIT Bombay; Summer
internship:
*Geometric optimization problems for collections of
balls*; Advisor: F. Cazals.

– Tom Dreyfus, university of Nice
Sophia-Antipolis; Topic:
*Modeling large macro-molecular assemblies*; Advisor:
F. Cazals.

– Christine-Andrea Roth, university of
Nice Sophia-Antipolis; Topic:
*Revisiting macro-molecular flexibility, with
applications to docking*; Advisor: F. Cazals.

F. Cazals gave the following invited talks:

*Balls, sticks, triangles and molecules*, closing
workshop of the ANR Triangles, Sophia-Antipolis,
December 2010.

*Assessing the stability of protein complexes within
large assemblies*, EMBO Symposium on Molecular
Perspectives on Protein-Protein Interactions, Sant
Feliu de Guixols, Spain, November 2010.

*Assessing the stability of protein complexes within
large assemblies*, XXIIeme Congrès de la Société
Française de Biophysique, La Colle sur Loup, September
2010.

*Assessing the stability of protein complexes within
large assemblies*, LAAS, Toulouse, September
2010.

*Geometric Models for the Description of 3D Molecular
Systems*, Energy Landscapes Workshop, Chemnitz, June
2010.

*Geometric Models for the Description of
High-dimensional Point Cloud Data*, Energy
Landscapes Workshop, Chemnitz, June 2010.

*Modeling the interface of protein - protein
complexes: shelling the Voronoi interface reveals
patterns of composition, residue conservation, and
water dynamics,*IBMC Strasbourg, Architecture et
réactivité de l'ARN, May 2010.

*Modeling water traffic at protein interfaces: from
Voronoi models to (simple) percolation on lattices*,
Ecole Normale Supérieure, groupe de travail
Probabilités et Statistiques, April 2010.

The ABS seminar featured presentations from the following visiting scientists:

– Charles Robert, Institut de Biologie Physico-Chimique, Paris.

– Annick Dejaegere, Institut de Génétique et de Biologie Moléculaire et Cellulaire, Strasbourg, France.

– Patrick Schultz, Institut de Génétique et de Biologie Moléculaire et Cellulaire, Strasbourg, France.

– Sameer Velankar, European Biological Institute, UK.

– Erik Aurell, Aalto University, Helsinki, Finland, and KTH Royal Institute of Technology, Stockholm, Sweden.