Computational Structural Biology (CSB) is the scientific domain
concerned with the development of algorithms and software to
understand and predict the structure and function of biological
macromolecules.
This research field is inherently multi-disciplinary.
On the experimental side, biology and medicine provide the objects
studied, while biophysics and bioinformatics supply experimental data, which are of two
main kinds. On the one hand, genome sequencing projects give supply
protein sequences, and ~200 millions of sequences have been archived in UniProtKB/TrEMBL – which collects the protein sequences yielded
by genome sequencing projects. On the other hand, structure
determination experiments (notably X-ray crystallography, nuclear magnetic resonance, and
cryo-electron microscopy) give access to geometric models of molecules
– atomic coordinates.
Alas, only ~150,000 structures have been solved and deposited in
the Protein Data Bank (PDB), a number to be compared against the
UniProtKB/TrEMBL. With one
structure for ~1000 sequences, we hardly know anything about
biological functions at the atomic/structural level.
Complementing experiments, physical chemistry/chemical physics supply
the required models (energies, thermodynamics, etc). More
specifically, let us recall that proteins with lock-and-key metaphor
for interacting molecules, Biology is based on the interactions stable
conformations make with each other. Turning these intuitive notions
into quantitative ones requires delving into statistical physics, as
macroscopic properties are average properties computed over ensembles
of conformations.
Developing effective algorithms to perform accurate simulations is
especially challenging for two main reasons. The first one is the high
dimension of conformational spaces – see tour
de force rarely achieved 38.
The first challenge, sequence-to-structure prediction, aims to
infer the possible structure(s) of a protein from its amino acid
sequence. While recent progress has been made
recently using in particular deep learning techniques
37, the models obtained so far are
static and coarse-grained.
The second one is protein function prediction. Given a protein
with known structure i.e. 3D coordinates, the goal is to predict the
partners of this protein, in terms of stability and
specificity. This understanding is fundamental to biology and
medicine, as illustrated by the example of the SARS-CoV-2 virus
responsible of the Covid19 pandemic. To infect a host, the virus first
fuses its envelope with the membrane of a target cell, and then
injects its genetic material into that cell. Fusion is achieved by a
so-called class I fusion protein, also found in other viruses
(influenza, SARS-CoV-1, HIV, etc). The fusion process is a highly
dynamic process involving large amplitude conformational changes of
the molecules. It is poorly understood, which hinders our ability to
design therapeutics to block it.
Finally, the third one, large assembly reconstruction,
aims at solving (coarse-grain) structures of molecular machines involving
tens or even hundreds of subunits. This research vein was promoted
about 15 years back by the work on the nuclear pore complex
26. It is often referred to as reconstruction by
data integration, as it necessitates to combine coarse-grain models
(notably from cryo-electron microscopy (cryo-EM) and native mass
spectrometry) with atomic models of subunits obtained from X ray
crystallography.
Fitting the latter into the former requires exploring
the conformation space of subunits, whence the importance of protein
dynamics.
As an illustration of these three challenges, consider
the problem of designing proteins blocking the entry of SARS-CoV-2 into our cells
(Fig. 1).
The first challenge is illustrated by the problem of predicting the
structure of a blocker protein from its sequence of amino-acids – a tractable
problem here since the mini proteins used only comprise of the order
of 50 amino-acids (Fig. 1(A), 29).
The second challenge is illustrated by the calculation of the binding
modes and the binding affinity of the designed proteins for the RBD of
SARS-CoV-2 (Fig. 1(B)).
Finally, the last challenge
is illustrated by the problem of solving structures of the virus
with a cell, to understand how many spikes are involved in the fusion
mechanism leading to infection.
In 29, the promising designs suggested by modeling
have been assessed by an array of wet lab experiments (affinity
measurements, circular dichroism for thermal stability assessment,
structure resolution by cryo-EM).
The hyperstable minibinders identified provide starting points
for SARS-CoV-2 therapeutics 29.
We note in passing that this is truly remarkable work, yet, the
designed proteins stem from a template (the bottom helix from
ACE2), and are rather small.
To present challenges in structural modeling, let us recall the following ingredients.
First, a molecular model with d.o.f.).
Second, recall that the potential energy landscape (PEL) is the
mapping CHARMM, AMBER, MARTINI, etc. Such PE belong to the realm of molecular mechanics,
and implement atomic or coarse-grain models. They may embark a solvent
model, either explicit or implicit. Their definition requires a
significant number of parameters (up to
These PE are usually considered good enough to study non covalent interactions – our focus, even tough they do not cover the modification of chemical bonds. In any case, we take such a function for granted 1.
The PEL codes all structural, thermodynamic, and kinetic properties, which can be obtained by averaging
properties of
conformations over so-called thermodynamic ensembles.
The structure of a macromolecular system requires the
characterization of active conformations and important intermediates
in functional pathways involving significant basins.
In assigning occupation probabilities to these conformations by
integrating Boltzmann's distribution, one treats thermodynamics.
Finally, transitions between the states,
modeled, say, by a master equation (a continuous-time Markov process),
correspond to kinetics.
Classical simulation methods based on molecular dynamics (MD)
and Monte Carlo sampling (MC) are developed in the lineage of the
seminal work by the 2013 recipients of the Nobel prize in chemistry
(Karplus, Levitt, Warshel), which was awarded “for the
development of multiscale models for complex chemical systems”.
However, except for highly specialized cases where massive
calculations have been used 38, neither MD nor MC
give access to the aforementioned time scales. In fact, the main
limitation of such methods is that they treat structural,
thermodynamic and kinetic aspects at once
32. The absence of specific
insights on these three complementary pieces of the puzzle makes it
impossible to optimize simulation methods, and results in general in
the inability to obtain converged simulations on biologically relevant
time-scales.
The hardness of structural modeling owes to three intertwined reasons.
First, PELs of biomolecules usually exhibit a number of critical
points exponential in the dimension 27; fortunately, they
enjoy a multi-scale structure 30.
Intuitively, the significant local minima/basins are those which are
deep or isolated/wide, two notions which are mathematically
qualified by the concepts of persistence and prominence.
Mathematically, problems are plagued with the curse of dimensionality and measure concentration phenomena.
Second, biomolecular processes are inherently multi-scale, with motions spanning
i.e. observables, are
average properties computed over ensembles of conformations, which calls
for a multi-scale statistical treatment both of thermodynamics and kinetics.
A natural and critical question naturally concerns the validation of models proposed in structural bioinformatics. For all three types of questions of interest (structures, thermodynamics, kinetics), there exist experiments to which the models must be confronted – when the experiments can be conducted.
For structures, the models proposed can readily be compared against
experimental results stemming from X ray crystallography, NMR, or cryo
electron microscopy. For thermodynamics, which we illustrate here
with binding affinities, predictions can be compared against
measurements provided by calorimetry or surface plasmon resonance.
Lastly, kinetic predictions can also be assessed by various experiments
such as binding affinity measurements (for the prediction of
Our research program ambition to develop a comprehensive set of novel concepts and algorithms to study protein dynamics, based on the modular framework of PEL.
As noticed while discussing Protein dynamics: core CS -
maths challenges, the integrated nature of simulation methods such
as MD or MC is such that these methods do not in general give access to
biologically relevant time scales.
The framework of energy landscapes
39, 36 (Fig. 2) is
much more modular, yet, large biomolecular systems remain out of
reach.
To make a definitive step towards solving the prediction of protein
dynamics, we will serialize the discovery and the exploitation of a
PEL 4, 13, 3.
Ideas and concepts from computational geometry/geometric motion
planning, machine learning, probabilistic algorithms, and numerical
probability will be used to develop two classes of probabilistic
algorithms.
The first deals with algorithms to discover/sketch PELs i.e. enumerate
all significant (persistent or prominent) local minima and their
connections across saddles, a difficult task since the number of all
local minima/critical points is generally exponential in the
dimension. To this end, we will develop a hierarchical data structure
coding PELs as well as multi-scale proposals to explore molecular
conformations. (Nb: in Monte Carlo methods, a proposal generates a new
conformation from an existing one.)
The second focuses on methods to exploit/sample PELs i.e. compute
so-called densities of states, from which all thermodynamic quantities
are given by standard
relations 2835. This is a
hard problem akin to high-dimensional numerical integration. To solve
this problem, we will develop a learning based strategy for the
Wang-Landau algorithm 34–an adaptive Monte Carlo
Markov Chain (MCMC) algorithm, as well as a generalization of
multi-phase Monte Carlo methods for convex/polytope volume
calculations 33, 31, for non
convex strata of PELs.
As discussed in the previous Section, the study of PEL and protein dynamics raises difficult algorithmic / mathematical questions. As an illustration, one may consider our recent work on the comparison of high dimensional distribution 6, statistical tests / two-sample tests 7, 10, the comparison of clustering 8, the complexity study of graph inference problems for low-resolution reconstruction of assemblies 9, the analysis of partition (or clustering) stability in large networks, the complexity of the representation of simplicial complexes 2. Making progress on such questions is fundamental to advance the state-of-the art on protein dynamics.
We will continue to work on such questions, motivated by CSB / theoretical biophysics, both in the continuous (geometric) and discrete settings. The developments will be based on a combination of ideas and concepts from computational geometry, machine learning (notably on non linear dimensionality reduction, the reconstruction of cell complexes, and sampling methods), graph algorithms, probabilistic algorithms, optimization, numerical probability, and also biophysics.
While our main ambition is to advance the algorithmic foundations
of molecular simulation, a major challenge will be to ensure that the
theoretical and algorithmic developments will change the fate of
applications, as illustrated by our case studies.
To foster such a symbiotic relationship between theory, algorithms and
simulation, we will pursue high quality software development and
integration within the SBL, and will also take the appropriate
measures for the software to be widely adopted.
Software development for structural bioinformatics is especially
challenging, combining advanced geometric, numerical and combinatorial
algorithms, with complex biophysical models for PEL and related
thermodynamic/kinetic properties. Specific features of the proteins
studied must also be accommodated.
About 50 years after the development of force fields and simulation
methods (see the 2013 Nobel prize in chemistry), the software implementing
such methods has a profound impact on molecular science at large.
One can indeed cite packages such as
CHARMM,
AMBER,
gromacs,
gmin,
MODELLER,
Rosetta,
VMD,
PyMol, ....
On the other hand, these packages are goal oriented, each tackling a
(small set of) specific goal(s). In fact, no real modular software
design and integration has taken place.
As a result, despite the high quality software packages available,
inter-operability between algorithmic building blocks has remained
very limited.
Predicting the dynamics of large molecular systems
requires the integration of advanced algorithmic building blocks /
complex software components.
To achieve a sufficient level of integration, we undertook the
development of the Structural Bioinformatics Library (SBL,
SB) 5, a generic
C++/python cross-platform library providing software to solve complex
problems in structural bioinformatics.
For end-users, the SBL provides ready to use, state-of-the-art
applications to model macro-molecules and their complexes at various
resolutions, and also to store results in perennial and easy to use data
formats (SBL Applications).
For developers, the SBL provides a broad C++/python toolbox with
modular design (SBL Doc).
This hybrid status targeting both end-users and developers stems from
an advanced software design involving four software components, namely
applications, core algorithms, biophysical models, and modules
(SBL Modules).
This modular design makes it possible to optimize robustness and the
performance of individual components, which can then be assembled
within a goal oriented application.
Our methods will be validated on various systems for which flexibility operates at various scales. Example such systems are antibody-antigen complexes, (viral) polymerases, (membrane) transporters.
Even very complex biomolecular systems are deterministic in prescribed
conditions (temperature, pH, etc), demonstrating that despite their
high dimensionality, all d.o.f. are not at play at the same time. This
insight suggests three classes of systems of particular interest.
The first class consists of systems defined from (essentially) rigid blocks
whose relative positions change thanks to conformational changes of
linkers; a Newton cradle provides an interesting way to envision such
as system. We have recently worked on one such system, a membrane
proteins involve in antibiotic resistance (AcrB, see
14.
The second class consists of cases where relative positions of subdomains do
not significantly change, yet, their intrinsic dynamics are
significantly altered. A classical illustration is provided by
antibodies, whose binding affinity owes to dynamics localized in six
specific loops 11, 12.
The third class, consisting of composite cases, will greatly benefit
from insights on the first two classes. As an example, we may
consider the spikes of the SARS-CoV-2 virus, whose function (performing
infection) involves both large amplitude conformational changes and
subtle dynamics of the so-called receptor binding domain. We have
started to investigate this system, in collaboration with B. Delmas
(INRAe) 15.
In ABS, we will investigate systems in these three tiers, in collaboration with expert collaborators, to hopefully open new perspectives in biology and medicine. Along the way, we will also collaborate on selected questions at the interface between CSB and systems biology, as it is now clear that the structural level and the systems level (pathways of interacting molecules) can benefit from one another.
The main application domain is Computational Structural Biology, as
underlined in the Research Program.
In October 2021, Edoardo Sarti has joined ABS as Chargé de
Recherche de Classe Normale. His expertise comprises a diverse set
of interests spanning from algorithmic questions about geometrical,
functional and evolutionary aspects of biomolecules (latest
study: 23), to the collection and analysis
of large collections of molecular structural data. From the very
start, E. Sarti has started taking part in several research and
technical projects of ABS.
See report on the Structural Bioinformatics Library.
The SBL is a generic C++/python cross-platform software library targeting complex problems in structural bioinformatics. Its tenet is based on a modular design offering a rich and versatile framework allowing the development of novel applications requiring well specified complex operations, without compromising robustness and performances.
More specifically, the SBL involves four software components (1-4 thereafter). For end-users, the SBL provides ready to use, state-of-the-art (1) applications to handle molecular models defined by unions of balls, to deal with molecular flexibility, to model macro-molecular assemblies. These applications can also be combined to tackle integrated analysis problems. For developers, the SBL provides a broad C++ toolbox with modular design, involving core (2) algorithms, (3) biophysical models, and (4) modules, the latter being especially suited to develop novel applications. The SBL comes with a thorough documentation consisting of user and reference manuals, and a bugzilla platform to handle community feedback.
In 2021, two new packages have been released. The first one, Frechet mean on the unit circle ( https://sbl.inria.fr/doc/Frechet_mean_S1-user-manual.html ) provides the first algorithm to compute the exact center of mass of angular data – a key step in computing rotamers for example. The second one, tripeptide loop closure ( https://sbl.inria.fr/doc/Tripeptide_loop_closure-user-manual.html ) is concerned with the reconstruction of all backbone geometries for a tripeptide when the 6 dihedral angles associated with the three Calpha carbons are free to move – all other internal coordinates being fixed. This algorithm is a cornerstone of move sets in internal coordinates.
The SBL has also started a new process of innovation and enhanced distribution. On the one hand, a pre-existing conda package has been revamped, thus extending the range of OS supporting ready-to-use SBL functions to MacOS and Windows. On the other hand, the library has been compiled in a user-ready Singularity container, and can be now used on any Linux system, including environments where the user does not have administrator/root privileges, e.g. clusters and distributed systems for scientific computing.
In this work 15, we introduce Multiple Interface String Alignment (MISA), a
visualization tool to display coherently various sequence and
structure based statistics at protein-protein interfaces (SSE
elements, buried surface area,
Illustrations are provided on the receptor binding domain (RBD) of coronaviruses, in complex with their cognate partner or (neutralizing) antibodies. MISA computed with a minimal number of structures complement and enrich findings previously reported.
The corresponding package is available from the Structural Bioinformatics Library (SBL
and MISA).
On December 2019, the Chinese Center for Disease Control reported several cases of severe pneumonia that resists usual treatments in the city of Wuhan. This was the beginning of the COVID-19 pandemic which caused more than 80 millions infection cases and 1.7 millions deaths during the year 2020 alone1 . This major outbreak has given rise to global public health responses as well as an international research effort of unprecedented scope and speed. This scientific mobilization has led to remarkable results, which have enabled a great deal of knowledge to be accumulated in just a few months on this novel pathogen: identification of the virus, of its main proteins, analysis of its origin and its functionning. This basic biological knowledge is mandatory to medical advances: design tests, find a vaccine or a cure.
In this document 21, one year after the beginning of the worldwide spread of the disease, we wish to shed particular light on the contribution of bioinformatics in all this work. Bioinformatics is a discipline at crossroads of computer sciences, mathematics and biology that has taken on an inestimable importance in modern biology and medicine. It provides computational models, algorithms and software to the scientific community, that are both operational and effective. The discovery and study of the SARS-Cov-2 coronavirus is an emblematic example. The utilization of bioinformatics methods has been at the heart of essential milestones : from the sequencing of the virus genome and its annotation to the history of its origin, the modelisation of interacting biological entities both at the molecular scale and at the network scale, and the study of the host genetic susceptibility. All these studies, as a whole, have made it possible to elucidate the nature and the functionning of the novel pathogen and have greatly contributed to the fight against COVID-19.
Prioritizing genes for their role in drug sensitivity, is an important
step in understanding drugs mechanisms of action and discovering new
molecular targets for co-treatment.
In this work 24, we formalize this problem by
considering two sets of genes Genetrank, a method to prioritize the genes in
Genetrank uses asymmetric random walks with
restarts, absorbing states, and a suitable renormalization scheme.
Using novel so-called saturation indices, we show that the conjunction
of absorbing states and renormalization yields an exploration of the
PPIN which is much more progressive than that afforded by random walks
with restarts only.
Using MINT as underlying network, we apply Genetrank to a
predictive gene signature of cancer cells sensitivity to
tumor-necrosis-factor-related apoptosis-inducing ligand (TRAIL),
performed in single-cells. Our ranking provides biological insights
on drug sensitivity and a gene set considerably enriched in genes
regulating TRAIL pharmacodynamics when compared to the most
significant differentially expressed genes obtained from a statistical
analysis framework alone. We also introduce gene expression
radars, a visualization tool to assess all pairwise interactions at
a glance.
Genetrank is made available in
the Structural Bioinformatics Library
(Genetrank). It
should prove useful for mining gene sets in conjunction with a
signaling pathway, whenever other approaches yield relatively large
sets of genes.
Tripeptide loop closure (TLC) is a standard procedure to reconstruct protein backbone conformations, by solving a polynomial system in a single variable yielding up to 16 real solutions.
In this work 17, we first show that multiprecision is required in a TLC
solver to guarantee the existence and the accuracy of solutions.
We then compare solutions yielded by the TLC solver against
tripeptides from the Protein Data Bank. We show that these solutions
are geometrically diverse (up to
We anticipate that these insights, coupled to our robust implementation in the Structural Bioinformatics Library (TLC), will help understanding the properties of TLC reconstructions, with potential applications to the generation of conformations of flexible loops in particular.
The center of mass of a point set lying on a manifold generalizes the celebrated Euclidean centroid, and is ubiquitous in statistical analysis in non Euclidean spaces.
In this work 18, we give a complete characterization of the weighted
Our derivations are of interest in two respects. First, efficient
Computing the volume of a high dimensional polytope is a fundamental problem in geometry, also connected to the calculation of densities of states in statistical physics, and a central building block of such algorithms is the method used to sample a target probability distribution.
This paper 16 studies Hamiltonian Monte Carlo (HMC) with reflections on
the boundary of a domain, providing an enhanced alternative to
Hit-and-run (HAR) to sample a target distribution restricted to the polytope.
We make three contributions.
First, we provide a convergence bound, paving the way to more precise
mixing time analysis.
Second, we present a robust implementation based on multi-precision
arithmetic, a mandatory ingredient to guarantee exact predicates
and robust constructions. We however allow controlled failures
to happen, introducing the Sweeten Exact Geometric Computing (SEGC) paradigm.
Third, we use our HMC random walk to perform H-polytope volume
calculations, using it as an alternative to HAR within the volume
algorithm by Cousins and Vempala.
The systematic tests conducted up to dimension
We analyze a generalization of the minimum connectivity inference problem (MCI) that models the computation of low-resolution structures of macro-molecular assemblies, based on data obtained by native mass spectrometry. The generalization studied in this work, allows us to consider more refined constraints for the characterization of low resolution models of large assemblies, such as degree constraints (e.g. a protein has a limited number of other proteins in contact).
More precisely, let -overlays a hyperedge
Given a graph Conflict Coloring consists in deciding whether exists a conflict coloring, that is a coloring in which Conflict Coloring is motivated by computational structural biology problems, high resolution determination of molecular assemblies.
The graph represents the subunits and the interaction between them, the colors are the given conformations, and the edges of the bipartite graphs are the incompatible conformations of two subunits.
In this work, we first establish the complexity dichotomies (polynomial vs Conflict Coloring and its variants. We provide some experiments in which we build instances of Conflict Coloring associated to Voronoi diagram in the plane, and we then analyse the existences of a solution related to parameters used in our experimental setup.
Frédéric Cazals participated to the following program committees:
PhD thesis:
Interns:
Frédéric Cazals participated to the following committees:
Dorian Mazauric participated to the following committees:
Dorian Mazauric:
Frédéric Cazals:
Dorian Mazauric:
Dorian Mazauric - Fête de la Science 2021:
Dorian Mazauric - Interventions at Maison de l'Intelligence Artificielle:
Dorian Mazauric - Cordées de la réussite (coordonné par Université Côte d'Azur):
Dorian Mazauric - Programme Chiche:
Dorian Mazauric - Formations:
Dorian Mazauric - In schools:
Dorian Mazauric - Internships: