**Computational Biology and Computational Structural Biology.**Understanding the lineage between species and the genetic drift of genes and genomes, apprehending the control and feed-back
loops governing the behavior of a cell, a tissue, an organ or a body, and inferring the relationship between the structure of biological (macro)-molecules and their functions are amongst the
major challenges of modern biology. The investigation of these challenges is supported by three types of data: genomic data, transcription and expression data, and structural data.

Genetic data feature sequences of nucleotides on DNA and RNA molecules, and are symbolic data whose processing falls in the realm of Theoretical Computer Science: dynamic programming,
algorithms on texts and strings, graph theory dedicated to phylogenetic problems. Transcription and expression data feature evolving concentrations of molecules (RNAs, proteins, metabolites)
over time, and fit in the formalism of discrete and continuous dynamical systems, and of graph theory. The exploration and the modeling of these data are covered by a rapidly expanding research
field termed
*systems biology*. Structural data encode informations about the
3
dstructures of molecules (nucleic acids, proteins, small molecules) and their interactions, and come from three main sources: X ray crystallography, NMR spectroscopy,
cryo Electron Microscopy. Ultimately, structural data should expand our understanding of how the structure accounts for the function of macro-molecules —one of the central question in
structural biology. This goal actually subsumes two equally difficult challenges, which are
*folding*—the process through which a protein adopts its
3
dstructure, and
*docking*—the process through which two or several molecules assemble. Folding and docking are driven by non covalent interactions, and for complex systems, are actually inter-twined
. Apart from the bio-physical interests raised by these processes, two different application domains are concerned: in
fundamental biology, one is primarily interested in understanding the machinery of the cell; in medicine, applications to drug design are developed.

**Modeling in Computational Structural Biology.**Acquiring structural data is not always possible: NMR is restricted to relatively small molecules; membrane proteins do not crystallize, etc.
As a matter of fact, while the order of magnitude of the number of genomes sequenced is one thousand, the Protein Data Bank contains (a mere) 45,000 structures. (Because one gene may yield a
number of proteins through splicing, it is difficult to estimate the number of proteins from the number of genes. However, the latter is several orders of magnitudes beyond the former.) For
these reasons,
*molecular modeling*is expected to play a key role in investigating structural issues.

Ideally, bio-physical models of macro-molecules should resort to quantum mechanics. While this is possible for small systems, say up to 50 atoms, large systems are investigated within the framework of the Born-Oppenheimer approximation which stipulates the nuclei and the electron cloud can be decoupled. Example force fields developed in this realm are AMBER, CHARMM, OPLS. Of particular importance are Van der Waals models, where each atom is modeled by a sphere whose radius depends on the atom chemical type. From an historical perspective, Richards , and later Connolly , while defining molecular surfaces and developing algorithms to compute them, established the connexions between molecular modeling and geometric constructions. Remarkably, a number of difficult problems (e.g. additively weighted Voronoi diagrams) were touched upon in these early days.

The models developed in this vein are instrumental in investigating the interactions of molecules for which no structural data is available. But such models often fall short from providing
complete answers, which we illustrate with the folding problem. On one hand, as the conformations of side-chains belong to discrete sets (the so-called rotamers or rotational isomers)
, the number of distinct conformations of a poly-peptidic chain is exponential in the number of amino-acids. On the
other hand, Nature folds proteins within time scales ranging from milliseconds to hours, which is out of reach for simulations. The fact that Nature avoids the exponential trap is known as
Levinthal's paradox. The intrinsic difficulty of problems calls for models exploiting several classes of informations. For small systems,
*ab initio*models can be built from first principles. But for more complex systems,
*homology*or template-based models integrating a variable amount of knowledge acquired on similar systems are resorted to.

The variety of approaches developed are illustrated by the two community wide experiments CASP (
*Critical Assessment of Techniques for Protein Structure Prediction*;
http://
*Critical Assessment of Prediction of Interactions*;
http://

As illustrated by the previous discussion, modeling macro-molecules touches upon biology, physics and chemistry, as well as mathematics and computer science. In the following, we present the topics investigated within ABS.

The ABSproject-team was created in July 2008.

The first PhD defense, that of S. Loriot, occurred on 12/02/2008.

F. Cazals was awarded a PhD grant by the Ministry of Research from the pool of 555 grants on
*Thématiques prioritaires.*

The research conducted by ABSfocuses on two main directions in Computational Structural Biology (CSB), each such direction calling for specific algorithmic developments. These directions are:

- Modeling interfaces and contacts,

- Modeling the flexibility of macro-molecules.

**Problems addressed.**The Protein Data Bank,
http://
*interacting*with atoms of the second one. Understanding the structure of interfaces is central to understand biological complexes and thus the function of biological molecules
. Yet, in spite of almost three decades of investigations, the basic principles guiding the formation of interfaces
and accounting for its stability are unknown
. Current investigations follow two routes. From the experimental perspective
, directed mutagenesis allows one to quantify the energetic importance of residues, important residues being termed
*hot*residues. Such studies recently evidenced the
*modular*architecture of interfaces
. From the modeling perspective, the main issue consists of guessing the hot residues from sequence and/or structural
informations
.

The description of interfaces is also of special interest to improve
*scoring functions*. By scoring function, two things are meant: either a function which assigns to a complex a quantity homogeneous to a free energy change
G=
H-
TS, with
H=
U+
PV.
Gis minimum at an equilibrium, and differences in
Gdrive chemical reactions.p_{i}(
r)the probability of two atoms –defining type
i– to be located at distance
r, the (free) energy assigned to the pair is computed as
E_{i}(
r) = -
kTlog
p_{i}(
r). Estimating from the PDB one function
p_{i}(
r)for each type of pair of atoms, the energy of a complex is computed as the sum of the energies of the pairs located within a distance threshold
,
. To compare the energy thus obtained to a reference state, one may compute
, with
p_{i}the observed frequencies, and
q_{i}the frequencies stemming from an a priori model
. In doing so, the energy defined is nothing but the Kullback-Leibler divergence between the distributions
{
p
_{i}}and
{
q
_{i}}.

**Methodological developments.**Describing interfaces poses problems in two settings: static and dynamic.

In the static setting, one seeks the minimalist geometric model providing a relevant bio-physical signal. A first step in doing so consists of identifying interface atoms, so as relate the geometry and the bio-chemistry at the interface level . To elaborate at the atomic level, one seeks a structural alphabet encoding the spatial structure of proteins. At the side-chain and backbone level, an example of such alphabet is that of . At the atomic level and in spite of recent observations on the local structure of the neighborhood of a given atom , no such alphabet is known. Specific important local conformations are known, though. One of them is the so-called dehydron structure, which is an under-desolvated hydrogen bond —a property that can be directly inferred from the spatial configuration of the carbons surrounding a hydrogen bond .

A structural alphabet at the atomic level may be seen as an alphabet featuring for an atom of a given type all the conformations this atom may engage into, depending on its neighbors. One
way to tackle this problem consists of extending the notions of molecular surfaces used so far, so as to encode multi-body relations between an atom and its neighbors
. In order to derive such alphabets, the following two strategies are obvious. On one hand, one may use an encoding
of neighborhoods based on geometric constructions such as Voronoï diagrams (affine or curved) or arrangements of balls. On the other hand, one may resort to clustering strategies in higher
dimensional spaces, as the
pneighbors of a given atom are represented by
3
p-6degrees of freedom —the neighborhood being invariant upon rigid motions.

In the dynamic setting, one wishes to understand whether selected (hot) residues exhibit specific dynamic properties, so as to serve as anchors in a binding process . More generally, any significant observation raised in the static setting deserves investigations in the dynamic setting, so as to assess its stability. Such questions are also related to the problem of correlated motions, which we discuss next.

**Problems addressed.**Proteins in vivo vibrate at various frequencies: high frequencies correspond to small amplitude deformations of chemical bonds, while low frequencies characterize more
global deformations. This flexibility contributes to the entropy thus the
`free energy`of the system
*protein - solvent*. From the experimental standpoint, NMR studies and Molecular Dynamics simulations generate ensembles of conformations, called
`conformers`. Of particular interest while investigating flexibility is the notion of correlated motion. Intuitively, when a protein is folded, all atomic movements must be correlated, a
constraint which gets alleviated when the protein unfolds since the steric constraints get relaxed
*diffusion - conformer selection - induced fit*complex formation model.

Parameterizing these correlated motions, describing the corresponding energy landscapes, as well as handling collections of conformations pose challenging algorithmic problems.

**Methodological developments.**At the side-chain level, the question of improving rotamer libraries is still of interest
. This question is essentially a clustering problem in the parameter space describing the side-chains
conformations.

At the atomic level, flexibility is essentially investigated resorting to methods based on a classical potential energy (molecular dynamics), and (inverse) kinematics. A molecular dynamics simulation provides a point cloud sampling the conformational landscape of the molecular system investigated, as each step in the simulation corresponds to one point in the parameter space describing the system (the conformational space) . The standard methodology to analyze such a point cloud consists of resorting to normal modes. Recently, though, more elaborate methods resorting to more local analysis , to Morse theory and to analysis of meta-stable states of time series have been proposed.

Given a sampling on an energy landscape, a number of fundamental issues actually arise: how does the point cloud describes the topography of the energy landscape (a question reminiscent from Morse theory)? can one infer the effective number of degrees of freedom of the system over the simulation, and is this number varying? Answers to these questions would be of major interest to refine our understanding of folding and docking, with applications to the prediction of structural properties. It should be noticed in passing such questions are probably related to modeling phase transitions in statistical physics where geometric and topological methods are being used .

From an algorithmic standpoint, such questions are reminiscent from
*shape learning*. Given a collection of samples on an (unknown)
*model*,
*learning*consists of guessing the model from the samples —the result of this process may be called the
*reconstruction*. In doing so, two types of guarantees are sought: topologically speaking, the reconstruction and the model should (ideally!) be isotopic; geometrically speaking, their
Hausdorff distance should be small. Motivated by applications in CAGD, surface reconstruction triggered a major activity in the Computational Geometry community over the past ten years
. Aside from applications, reconstruction raises a number of deep issues: the study of distance functions to the
model and to the samples, and their comparison
; the study of Morse-like constructions stemming from distance functions to points
; the analysis of topological invariants of the model and the samples, and their comparison
,
.

Last but not the least, gaining insight on such questions would also help to effectively select a reduced set of conformations best representing a larger number of conformations. This
selection problem is indeed faced by flexible docking algorithms that need to maintain and/or update collections of conformers for the second stage of the
*diffusion - conformer selection - induced fit*complex formation model.

We recently proposed an interface model of (macro-)molecular interfaces based upon power diagrams
. The corresponding software,
*Intervor*, has been made available to the community from the web site
http://
*Protein Science*, and the server has been used about 1000 times since then. To the best of our knowledge, this code is the only publicly available one for analyzing (Voronoi) interfaces
in complexes.

Following the recent strategy we developed to identify biological and crystallographic contacts using support vector machines
, we made a web server available for that purpose. The DiMoVo server
http://

Available online in 2007, the VorScore server
http://

In collaboration with L. Rineau and S. Pion,
Geometrica. Work started by Nicolas Carrez, summer intern, 2005.
http://

cgalis a C++ library of geometric algorithms initially developed within two European projects (project ESPRIT IV LTR CGAL December 97 - June 98, project ESPRIT IV LTR GALIA november 99 - august 00) by a consortium of eight research teams from the following institutes: Universiteit Utrecht, Max-Planck Institut Saarbrücken, INRIA Sophia Antipolis, ETH Zürich, Tel Aviv University, Freie Universität Berlin, Universität Halle, RISC Linz. The goal of cgalis to make the solutions offered by the computational geometry community available to the industrial world and applied domains.

The IPE editor, see
http://

Based on the
2
Dalgorithms present in the CGAL library, we developed in C++ a set of plugins, so as to make the following algorithms available from
ipe: triangulations (Delaunay, constrained Delaunay, regular) as well as their duals, a convex hull algorithm, polygon partitioning algorithms,
polygon offset, arrangements of linear and degree two primitives. These plugins are available under the Open Source LGPL license, and are subject to the constraints of the underlying CGAL
packages. They can be downloaded from
http://

.

B. Bouvier is with Institut de Biologie et de Chimie des Protéines, CNRS/Lyon Univ., France; R. Grünberg is with EMBL-CRG Systems Biology Unit, Barcelona, Spain; Nilges is with Unité de Bioinformatique Structurale, Institut Pasteur Paris, France.

The accurate description and analysis of protein-protein interfaces remains a challenging task. Traditional definitions, based on atomic contacts or changes in solvent accessibility, tend to over- or underpredict the interface itself and cannot discriminate active from less relevant parts. This paper introduces a fast, parameter-free and purely geometric definition of protein interfaces and introduce the shelling order of Voronoi facets as a novel measure for an atom's depth inside the interface. Our analysis of 54 protein-protein complexes reveals a strong correlation between Voronoi Shelling Order (VSO) and water dynamics. High Voronoi Shelling Order coincides with residues that were found shielded from bulk water fluctuations in a recent molecular dynamics study. Yet, VSO predicts such “dry” residues without consideration of forcefields or dynamics at dramatically reduced cost. More central interface positions are often also increasingly enriched for hydrophobic residues. Yet, this hydrophobic centering is not universal and does not mirror the far stronger geometric bias of water fluxes. The seemingly complex water dynamics at protein interfaces appears thus largely controlled by geometry. Sequence analysis supports the functional relevance of both dry residues and residues with high VSO, both of which tend to be more conserved. Upon closer inspection, the spatial distribution of conservation argues against the arbitrary dissection into core or rim and thus refines previous results. Voronoi Shelling Order reveals clear geometric patterns in protein interface composition, function and dynamics and facilitates the comparative analysis of protein-protein interactions.

.

R. Bahadur is Postdoctoral Scholar at the Jacobs University of Bremen, Germany; F. Rodier is retired from LEBS in CNRS Gif-sur-Yvette, France; J. Janin is Emeritus Professor at the Université Paris-Sud, Orsay, France; A. Poupon is in the Physiologie de la Reproduction et des Comportements lab, INRA Tours.

Knowledge of the oligomeric state of a protein is often essential for understanding its function and mechanism. Within a protein crystal, each protein monomer is in contact with many others, forming many small interfaces and a few larger ones that are biologically significant if the protein is a homodimer in solution, but not if the protein is monomeric. Telling such "crystal dimers" from real ones remains a difficult task.

It has already been demonstrated that the interfaces of native and non-native protein-protein complexes can be distinguished using a combination of parameters computed with a method on the Voronoi tessellation. We show in our study that the same parameters highlight significant differences between the interfaces of biological and crystal dimers. Using these parameters as descriptors in machine learning methods leads to accurate classification of specific and non-specific protein-protein interfaces.

.

S. Sachdeva is currently PhD student at Princeton University; K. Bastard is with Biotechnologie-Biocatalyse-Biorégulation, Université de Nantes - CNRS, France; C. Prévot is with Institut de Biologie Physico-Chimique, Paris, France.

To address challenging flexible docking problems, a number of docking algorithms pre-generate large collections of candidate conformers. To further remove the redundancy from such ensembles, a central question in this context is the following one: report a selection of conformers maximizing some geometric diversity criterion. In this context, this paper makes three contributions.

First, we tackle this problem resorting to geometric optimization so as to report selections maximizing the molecular volume or molecular surface area (MSA) of the selection. Greedy strategies are developed, together with approximation bounds.

Second, to assess the efficacy of our algorithms, we investigate two conformer ensembles corresponding to a flexible loop of four protein complexes. By focusing on the MSA of the selection, we show that our strategy matches the MSA of standard selection methods, but resorting to a number of conformers between one and two orders of magnitude smaller. This observation is qualitatively explained using the Betti numbers of the union of balls of the selection.

Finally, we replace the conformer selection problem in the context of multiple-copy flexible docking. On the systems above, we show that using the loops selected by our strategy can significantly improve the result of the docking process.

F. Chazal is with INRIA Saclay - Geometrica; J. Giesen is Professor at Jena University.

Life sciences, engineering, or telecommunications provide numerous systems whose description requires a large number of variables. Developing insights into such systems, forecasting their evolution, or monitoring them is often based on the inference of correlations between these variables. Given a collection of points describing states of the system, questions such as inferring the effective number of independent parameters of the system (its intrinsic dimensionality) and the way these are coupled are paramount to develop models. In this context, this paper makes two contributions.

First, we review recent work on spectral techniques to organize point clouds in Euclidean space, with emphasis on the main difficulties faced. Second, after a careful presentation of the bio-physical context, we present applications of dimensionality reduction techniques to a core problem in structural biology, namely protein folding.

Both from the computer science and the structural biology perspective, we expect this survey to shed new light on the importance of
*non linear computational geometry*in geometric data analysis in general, and for protein folding in particular.

S. Pion is with INRIA Geometrica (Sophia-Antipolis), and A. Parameswaran is currently PhD student at Stanford University.

The Delaunay triangulation and its dual the Voronoi diagram are ubiquitous geometric complexes. From a topological standpoint, the connection has recently been made between these cell complexes and the Morse theory of distance functions. In particular, in the generic setting, algorithms have been proposed to compute the flow complex —the stable and unstable manifolds associated to the critical points of the distance function to a point set. As algorithms ignoring degenerate cases and numerical issues are bound to fail on general inputs, this paper develops the first complete and robust algorithm to compute the flow complex.

First, we present complete algorithms for the flow operator, unraveling a delicate interplay between the degenerate cases of Delaunay and those which are flow specific. Second, we sketch how the flow operator unifies the construction of stable and unstable manifolds. Third, we discuss numerical issues related to predicates on cascaded constructions. Finally, we report experimental results with CGAL's filtered kernel, showing that the construction of the flow complex incurs a small overhead w.r.t. the Delaunay triangulation when moderate cascading occurs. These observations provide important insights on the relevance of the flow complex for (surface) reconstruction and medial axis approximation, and should foster flow complex based algorithms.

In a broader perspective and to the best of our knowledge, this paper is the first one reporting on the effective implementation of a geometric algorithm featuring cascading.

C. Karande is currently PhD student at Georgia Tech.

Reporting the maximal cliques of a graph is a fundamental problem arising in many areas. This note
bridges the gap between three papers addressing this problem: the original paper of Bron-Kerbosh (Comm. of the ACM,
1973), and two papers recently published in
*Theoretical Computer Science*, namely that of Tomita et al. (
*Theoretical Computer Science*363, 2006), and that of Koch (
*Theoretical Computer Science*250, 2001). In particular, we show that the strategy of Tomita et al. is a simple modification of the Bron-Kerbosch algorithm, based on an (un-exploited)
observation raised in Koch's paper.

The France-Stanford Center for Interdisciplinary Studies is funding a two-year project (2007-08) entitled
*Developments of Geometric Methods and Algorithms for the study of macro molecular assemblies*. The PIs are F. Cazals (INRIA) and M. Levitt (chair of the Structural Biology Dpt, Stanford
University). The goal of the project is to make a stride towards improved multi-scale modeling of large protein complexes.

– Frédéric Cazals was member of the paper committees of the Eurographics Symposium on Geometry Processing'08, of the ACM Symposium on Solid and Physical Modeling'08, of the International Conference on Pattern Recognition in Bioinformatics'08, of the International Symposium on 3D Data Processing, Visualization, and Transmission'08, and of the Symposium on Point-Based Graphics'08.

– Frédéric Cazals was reviewer for the Ph.D. thesis of Christine Martin (Univ. Orsay).

– Frédéric Cazals was a member of two committes at Université d'Evry-Val-d'Essonne, for the recruitment of one
*Professor*and one
*Maître de conferences*in CS/Bioinformatics.

Modelling in structural biology is a topic of interest for a number of groups around Sophia-Antipolis and Nice, both in academia and (CNRS, université de Nice-Sophia-Antipolis, INRA, INRIA), and industry (in particular Galderma, one of the worldwide leading dermatology companies). Researchers from these organizations meet half a day once every trimester, to attend two talks on topics of general interest in this realm. The organization of these meetings for the academic year 2008-09 is handled by ABS.

The ABSweb server is up and running since July 2008.

- Master Bioinformatique et Biostatistiques (BIBS), Orsay University; Algorithmic Problems in Computational Structural Biology; F. Cazals (12h), J. Janin (6h), C. Robert (6h).

- Cursus Ecole Normale Supérieure de Lyon à Sophia-Antipolis; Topics in Structural Biology; F. Cazals (16h), J. Bernauer (8h).

- AgroParisTech, Paris, MAP3 (module d'approfondissement) Ingénierie des protéines, cursus ingénieur agronome, deuxième année; Introduction à la bioinformatique; J.Bernauer (3h).

- Polytech'Nice-Sophia, troisième année, Option Bio-Informatique et Modélisation pour la Biologie; Biogeometry; J.Bernauer (3h).

Internship proposals can be seen on the web from the
*Positions*section at
http://

– Aaditya Ramdas,
*Landmark based dimensionality reduction*, IIT Bombay.

– Nicolas Bonifas,
*Comparing Voronoi interfaces of protein-protein complexes*, ENS Lyon.

– Tom Dreyfus,
*Modeling large macro-molecular assemblies*, université de Nice Sophia-Antipolis.

– Sébastien Loriot,
*Arrangements de Cercles sur une Sphère: Algorithmes et Applications aux Modèles Moléculaires Representés par une Union de Boules*, université de Bourgogne. Defended on 12/02 in front of
the following committee: R. Lavery (Univ. Lyon - CNRS; rapporteur); J. Snoeyink (Univ. North Carolina, rapporteur); John Maddock (Ecole Polytechnique Fédérale de Lausanne; examinateur); F.
Chazal (INRIA Saclay - Geometrica; co-advisor); F. Cazals (INRIA Sophia-Antipolis - ABS, co-advisor).

Members of the project have presented their published articles at conferences. The reader can refer to the bibliography to obtain the corresponding list. We list below all other talks given in seminars or summer schools.

–
*Modeling protein - protein interactions: the geometry of active residues and the selection of conformers*; IRISA, Rennes, 05/08. F.Cazals.

–
*Protein folding: energy landscapes, spectral analysis, and Morse theory*; Journées de Dynamique Non Linéaire, Marseille; 01/08. F.Cazals.

–
*Describing protein-protein and atomic environments: a geometric perspective*; (i) Séminaire Mathématiques Appliquées à la Génomique: Modèles et Algorithmes, Marseille, 01/08; (ii)
Université de Nice, séminaire du Dpt de Mathématiques, 01/08. F.Cazals.

–
*Protein - Protein interactions: towards improved predictions*; Sanofi - Aventis research seminar, 12/08.

–
*Machine learning and protein docking scoring functions*; EMBO Workshop "Docking Predictions of Protein-Protein Interaction", Barcelona, 14-17 October 08. J.Bernauer.

The ABS seminar series started in 2008 and featured presentations from the following visiting scientists:

– M. Levitt, Stanford University, USA;

– S. Dokudovskaya, Inst. Jacques Monod, Paris, France;

– H-X. Zhou, Florida State University, USA;

– S. Flores, Stanford University, USA.

ABShas hosted the following scientists:

– Dahlia Weiss, visiting PhD student from Stanford University, from 01/30/08 to 06/16/08.