Computational Biology and Computational Structural Biology.Understanding the lineage between species and the genetic drift of genes and genomes, apprehending the control and feed-back loops governing the behavior of a cell, a tissue, an organ or a body, and inferring the relationship between the structure of biological (macro)-molecules and their functions are amongst the major challenges of modern biology. The investigation of these challenges is supported by three types of data: genomic data, transcription and expression data, and structural data.
Genetic data feature sequences of nucleotides on DNA and RNA molecules, and are symbolic data whose processing falls in the realm of Theoretical Computer Science: dynamic programming,
algorithms on texts and strings, graph theory dedicated to phylogenetic problems. Transcription and expression data feature evolving concentrations of molecules (RNAs, proteins, metabolites)
over time, and fit in the formalism of discrete and continuous dynamical systems, and of graph theory. The exploration and the modeling of these data are covered by a rapidly expanding research
field termed
systems biology. Structural data encode informations about the
3
dstructures of molecules (nucleic acids, proteins, small molecules) and their interactions, and come from three main sources: X ray crystallography, NMR spectroscopy,
cryo Electron Microscopy. Ultimately, structural data should expand our understanding of how the structure accounts for the function of macro-molecules —one of the central question in
structural biology. This goal actually subsumes two equally difficult challenges, which are
folding—the process through which a protein adopts its
3
dstructure, and
docking—the process through which two or several molecules assemble. Folding and docking are driven by non covalent interactions, and for complex systems, are actually inter-twined
. Apart from the bio-physical interests raised by these processes, two different application domains are concerned: in
fundamental biology, one is primarily interested in understanding the machinery of the cell; in medicine, applications to drug design are developed.
Modeling in Computational Structural Biology.Acquiring structural data is not always possible: NMR is restricted to relatively small molecules; membrane proteins do not crystallize, etc. As a matter of fact, as of October 2007, about 1,000 genomes have been fully sequenced or are about to be so, while the Protein Data Bank contains (a mere) 40,000 structures. For these reasons, molecular modelingis expected to play a key role in investigating structural issues.
Ideally, bio-physical models of macro-molecules should resort to quantum mechanics. While this is possible for small systems, say up to 50 atoms, large systems are investigated within the framework of the Born-Oppenheimer approximation which stipulates the nuclei and the electron cloud can be decoupled. Example force fields developed in this realm are AMBER, CHARMM, OPLS. Of particular importance are Van der Waals models, where each atom is modeled by a sphere whose radius depends on the atom chemical type. From an historical perspective, Richards , and later Connolly , while defining molecular surfaces and developing algorithms to compute them, established the connexions between molecular modeling and geometric constructions. Remarkably, a number of difficult problems (e.g. additively weighted Voronoi diagrams) were touched upon in these early days.
The models developed in this vein are instrumental in investigating the interactions of molecules for which no structural data is available. But such models often fall short from providing complete answers, which we illustrate with the folding problem. On one hand, as the conformations of side-chains belong to discrete sets (the so-called rotamers or rotational isomers) , the number of distinct conformations of a poly-peptidic chain is exponential in the number of amino-acids. On the other hand, Nature folds proteins withing time scales ranging from milliseconds to hours, which is out of reach for simulations. The fact that Nature avoids the exponential trap is known as Levinthal's paradox. The intrinsic difficulty of problems calls for models exploiting several classes of informations. For small systems, ab initiomodels can be built from first principles. But for more complex systems, homologyor template-based models integrating a variable amount of knowledge acquired on similar systems are resorted to.
The variety of approaches developed are illustrated by the two community wide experiments CASP (
Critical Assessment of Techniques for Protein Structure Prediction;
http://
As illustrated by the previous discussion, modeling macro-molecules touches upon biology, physics and chemistry, as well as mathematics and computer science. In the following, we present the topics investigated within ABS.
The ABSteam was created in July 2007. Julie Bernauer, postdoc with Michael Levitt in the Structural Biology Dpt at Stanford University, will take her position in December 2007 –Julie was recruited as CR2 in June 2007. Following this post-doc position, a project funded by the France Stanford center for Inter-disciplinary studies got accepted, the principal investigators being F. Cazals, M. Levitt, and J. Bernauer.
The research conducted by ABSfocuses on two main directions in Computational Structural Biology (CSB), each such direction calling for specific algorithmic developments. These direction are:
- Modeling interfaces and contacts,
- Modeling the flexibility of macro-molecules.
Problems addressed.The Protein Data Bank,
http://
The description of interfaces is also of special interest to improve
scoring functions. By scoring function, two things are meant: either a function which assign to a complex a quantity homogeneous to a free energy change
G=
H-
TS, with
H=
U+
PV.
Gis minimum at an equilibrium, and differences in
Gdrive chemical reactions.p_{i}(
r)the probability of two atoms –defining type
i– to be located at distance
r, the (free) energy assigned to the pair is computed as
E_{i}(
r) = -
kTlog
p_{i}(
r). Estimating from the PDB one function
p_{i}(
r)for each type of pair of atoms, the energy of a complex is computed as the sum of the energies of the pairs located within a distance threshold
,
. To compare the energy thus obtained to a reference state, one may compute
, with
p_{i}the observed frequencies, and
q_{i}the frequencies stemming from an a priori model
. In doing so, the energy defined is nothing but the Kullback-Leibler divergence between the distributions
{
p
_{i}}and
{
q
_{i}}.
Methodological developments.Describing interfaces poses problems in two settings: static and dynamic.
In the static setting, one seeks the minimalist geometric model providing a relevant bio-physical signal. A first step in doing so consists of identifying interface atoms, so as relate the geometry and the bio-chemistry at the interface level . To elaborate at the atomic level, one seeks a structural alphabet encoding the spatial structure of proteins. At the side-chain and backbone level, an example such alphabet is that of . At the atomic level and in spite of recent observations on the local structure of the neighborhood of a given atom , no such alphabet is known. Specific important local conformations are known, though. One of them is the so-called dehydron structure, which is an under-desolvated hydrogen bond —a property that can be directly inferred from the spatial configuration of the carbons surrounding a hydrogen bond .
A structural alphabet at the atomic level may be seen as an alphabet featuring for an atom of a given type all the conformations this atom may engage into, depending on its neighbors. One
way to tackle this problem consists of extending the notions of molecular surfaces used so far, so as to encode multi-body relations between an atom and its neighbors
. In order to derive such alphabets, the following two strategies are obvious. On one hand, one may use an encoding
of neighborhoods based on geometric constructions such as Voronoï diagrams (affine or curved) or arrangements of balls. On the other hand, one may resort to clustering strategies in higher
dimensional spaces, as the
pneighbors of a given atom are represented by
3
p-6degrees of freedom —the neighborhood being invariant upon rigid motions.
In the dynamic setting, one wishes to understand whether selected (hot) residues exhibit specific dynamic properties, so as to serve as anchors in a binding process . More generally, any significant observation raised in the static setting deserves investigations in the dynamic setting, so as to assess its stability. Such questions are also related to the problem of correlated motions, which we discuss next.
Problems addressed.Proteins in vivo vibrate at various frequencies: high frequencies correspond to small amplitude deformations of chemical bonds, while low frequencies characterize more
global deformations. This flexibility contributes to the entropy thus the
free energyof the system
protein - solvent. From the experimental standpoint, NMR studies and Molecular Dynamics simulations generate ensembles of conformations, called
conformers. Of particular interest while investigating flexibility is the notion of correlated motion. Intuitively, when a protein is folded, all atomic movements must be correlated, a
constraint which gets alleviated when the protein unfolds since the steric constraints get relaxed
Parameterizing these correlated motions, describing the corresponding energy landscapes, as well as handling collections of conformations pose challenging algorithmic problems.
Methodological developments.At the side-chain level, the question of improving rotamer libraries is still of interest . This question is essentially a clustering problem in the parameter space describing the side-chains conformations.
At the atomic level, flexibility is essentially investigated resorting to methods based on a classical potential energy (molecular dynamics), and (inverse) kinematics. A molecular dynamics simulation provides a point cloud sampling the conformational landscape of the molecular system investigated, as each step in the simulation corresponds to one point in the parameter space describing the system (the conformational space) . The standard methodology to analyze such a point cloud consists of resorting to normal modes. Recently, though, more elaborate methods resorting to more local analysis , to Morse theory and to analysis of meta-stable states of time series have been proposed.
Given a sampling on an energy landscape, a number of fundamental issues actually arise: how does the point cloud describes the topography of the energy landscape (a question reminiscent from Morse theory)? can one infer the effective number of degrees of freedom of the system over the simulation, and is this number varying? Answers to these questions would be of major interest to refine our understanding of folding and docking, with applications to the prediction of structural properties. It should be noticed in passing such questions are probably related to modeling phase transitions in statistical physics where geometric and topological methods are being used .
From an algorithmic standpoint, such questions are reminiscent from shape learning. Given a collection of samples on an (unknown) model, learningconsists of guessing the model from the samples —the result of this process may be called the reconstruction. In doing so, two types of guarantees are sought: topologically speaking, the reconstruction and the model should (ideally!) be isotopic; geometrically speaking, their Hausdorff distance should be small. Motivated by applications in CAGD, surface reconstruction triggered a major activity in the Computational Geometry community over the past five years . Aside from applications, reconstruction raise a number of deep issues: the study of distance functions to the model and to the samples, and their comparison ; the study of Morse-like constructions stemming from distance functions to points ; the analysis of topological invariants of the model and the samples, and their comparison , .
Last but not the least, gaining insight on such questions would also help to effectively select a reduced set of conformations best representing a larger number of conformations. This selection problem is indeed faced by flexible docking algorithms that need to maintain and/or update collections of conformers for the second stage of the diffusion - conformer selection - induced fitcomplex formation model.
We recently proposed an interface model of (macro-)molecular interfaces based upon power diagrams
. The corresponding software,
Intervor, has been made available to the community from the web site
http://
In collaboration with L. Rineau and S. Pion,
Geometrica. Work started by Nicolas Carrez, summer intern, 2005.
http://
cgalis a C++ library of geometric algorithms initially developed within two European projects (project ESPRIT IV LTR CGAL December 97 - June 98, project ESPRIT IV LTR GALIA november 99 - august 00) by a consortium of eight research teams from the following institutes: Universiteit Utrecht, Max-Planck Institut Saarbrücken, INRIA Sophia Antipolis, ETH Zürich, Tel Aviv University, Freie Universität Berlin, Universität Halle, RISC Linz. The goal of cgalis to make the solutions offered by the computational geometry community available to the industrial world and applied domains.
The IPE editor, see
http://
Based on the
2
Dalgorithms present in the CGAL library, we developed in C++ a set of plugins, so as to make the following algorithms available from
ipe: triangulations (Delaunay, constrained Delaunay, regular) as well as their duals, a convex hull algorithm, polygon partitioning algorithms,
polygon offset, arrangements of linear and degree two primitives. The first version was released on 08/13/2007. These plugins are available under the Open Source LGPL license, and are subject
to the constraints of the underlying CGAL packages. They can be downloaded from
http://
Given a collection of circles on a sphere, we adapt the Bentley-Ottmann algorithm to the spherical setting to compute the exactarrangement of the circles , . The algorithm consists of sweeping the sphere with a meridian, which is non trivial because of the degenerate cases and the algebraic specification of event points.
From an algorithmic perspective, and with respect to general sweep-line algorithms, we investigate a strategy maintaining a linear size event queue. (The algebraic aspects involved in the development of the predicates involved in our algorithm are reported in .)
From an implementation perspective, we present the first effective arrangement calculation dealing with general circles on a sphere in an exact fashion, as exactness incurs a mere factor of two with respect to calculations performed using doublefloating point numbers on generic examples. In particular, we stress the importance of maintaining a linear size queue, in conjunction with arithmetic filter failures.
From an application perspective, we present an application in structural biology. Given a collection of atomic balls, we adapt the sweep-line algorithm to report all balls covering a given face of the spherical arrangement on a given atom. This calculation is used to define molecular surface related quantities going beyond the classical exposed and buried solvent accessible surface areas. Spectacular differences w.r.t. traditional observations on protein - protein and protein - drug complexes are also reported.
In collaboration with P. Machado Manhães de Castro and M. Teillaud, Geometrica. The first part of this work was partially done while Pedro Machado Manhães de Castro was visiting Geometricain 2006, in the framework of the INRIA Intership Programme.
A 3D kernel for the manipulations of spheres, circles and circular arcs in 3D was submitted to the cgalEditorial board. The package follows the same overall design as the 2D circular kernel. It proposes functionalities involving these objects. It also defines the concept of an algebraic kernel dedicated to the special case of spheres, lines and circles in 3D.
We proposed more recently to expand this package by adding objects and functionalities dedicated to the case where all the objects handled are located on a reference sphere. We showed how the two frameworks can be combined .
In collaboration with M. Pouget (VEGAS project, LORIA, NANCY).
Surfaces of are ubiquitous in science and engineering, and estimating the local differential properties of a surface discretized as a point cloud or a triangle mesh is a central building block in all situations where surfaces are dealt with. One strategy to perform such an estimation consists of resorting to polynomial fitting, either interpolation or approximation, but this route is difficult for several reasons: choice of the coordinate system, numerical handling of the fitting problem, extraction of the differential properties.
The goal of this ARC is to develop new methodologies to handle the flexibility of proteins. The ARC, which is coordinated by F. Cazals, features a collaboration with Institut Pasteur Paris
(M. Nilges), and MPI Saarebrucken (J. Giesen). The following topics have been investigated: characterization of dynamic properties of interface atoms during Molecular Dynamics simulations of
a complex; optimization algorithms to perform conformer selection amongst a molecular ensemble; characterization of induced fit mechanisms in protein-drug binding. Additional details can be
found on the following web site
http://
The France-Stanford Center for Interdisciplinary Studies is funding a two-year project (2007-08) entitled Developments of Geometric Methods and Algorithms for the study of macro molecular assemblies. The PIs are F. Cazals (INRIA) and M. Levitt (chair of the Structural Biology Dpt, Stanford University). The goal of the project is to make a stride towards improved multi-scale modeling of large protein complexes.
- Frédéric Cazals was member of the paper committee of the Eurographics Symposium on Geometry Processing 07, of the ACM Symposium on Solid and Physical Modeling 07, and of New Advances in Shape Analysis and Geometric Modeling 07.
- Frédéric Cazals is member of the INRIA Scientific Steering Committee (COST).
The ABSweb server is under construction –release scheduled for December 2007.
- Ecole Centrale Paris, 3rd year (master); Introduction to Computational Structural Biology; F. Cazals, 21h.
- Master Bioinformatique et Biostatistiques (BIBS), Orsay University; Algorithmic Problems in Computational Structural Biology; F. Cazals (12h), J. Janin (6h), C. Robert (6h).
- Cursus Ecole Normale Supérieure de Lyon à Sophia-Antipolis; Topics in Structural Biology; F. Cazals, 24h.
Internship proposals can be seen on the web at
http://
- Sushant Sachdeva, Optimization problems for conformer selection, IIT Bombay.
- Sébastien Loriot, Modélisation mathématique, calcul et classification de poches d'arrimage de médicaments sur les protéines, université de Bourgogne.
Members of the project have presented their published articles at conferences. The reader can refer to the bibliography to obtain the corresponding list. We list below all other talks given in seminars or summer schools.
- Modèles et algorithmes pour la description des interactions macro-moléculaires: le triptyque biophysique - géométrie - statistiques; French Academy of Sciences, Colloquium Avancées en Sciences de l'Information présentées par leurs auteurs; F. Cazals.
- Geometric and topological inference in the non linear realm: on the importance of singularity theory; Non-Linear Computational Geometry, Institute for Mathematics and its Applications, Univ. of Minnesota; F. Cazals.
- Modèles pour la description de la structure des protéines: de la géométrie à la bio-physique en passant par les statistiques; Bioinformatique, modélisation des systèmes biologiques Journées 2007, ACI-IMPBIO & GDR BIM, Institut Henri Poincaré (Paris); F. Cazals.
- Describing protein-protein and atomic environments: a geometric perspective; Dpts of Biopharmaceutical Sciences and Pharmaceutical Chemistry (SALILAB), UCSF; Dpt of Structural Biology, Stanford Univ., F. Cazals.
http://
Joint seminar with Geometrica.
- F. Cazals visited Stanford University for two weeks (November).
ABShas hosted the following scientists:
- Raik Gruenberg, Univ. of Barcelona, one week, July 2007.