## Section: Research Program

### Integrative Multi-Component Assembly and Modeling

#### Context

At the molecular level, each PPI is embodied by a physical 3D protein-protein interface. Therefore, if the 3D structures of a pair of interacting proteins are known, it should in principle be possible for a docking algorithm to use this knowledge to predict the structure of the complex. However, modeling protein flexibility accurately during docking is very computationally expensive. This is due to the very large number of internal degrees of freedom in each protein, associated with twisting motions around covalent bonds. Therefore, it is highly impractical to use detailed force-field or geometric representations in a brute-force docking search. Instead, most protein docking algorithms use fast heuristic methods to perform an initial rigid-body search in order to locate a relatively small number of candidate binding orientations, and these are then refined using a more expensive interaction potential or force-field model, which might also include flexible refinement using molecular dynamics (MD), for example.

#### Polar Fourier Docking Correlations

In our *Hex* protein docking program [60],
the shape of a protein molecule is represented using polar Fourier series
expansions of the form

$\sigma \left(\underline{x}\right)=\sum _{nlm}{a}_{nlm}{R}_{nl}\left(r\right){y}_{lm}(\theta ,\phi ),$ | (1) |

where $\sigma \left(\underline{x}\right)$ is a 3D shape-density function,
${a}_{nlm}$ are the expansion coefficients,
${R}_{nl}\left(r\right)$ are orthonormal Gauss-Laguerre polynomials and
${y}_{lm}(\theta ,\phi )$ are the real spherical harmonics.
The electrostatic potential, $\phi \left(\underline{x}\right)$,
and charge density, $\rho \left(\underline{x}\right)$,
of a protein may be represented using similar expansions.
Such representations
allow the *in vacuo* electrostatic interaction energy
between two proteins,
A and B, to be calculated as [51]

$E=\frac{1}{2}\int {\phi}_{A}\left(\underline{x}\right){\rho}_{B}\left(\underline{x}\right)\mathrm{d}\underline{x}+\frac{1}{2}\int {\phi}_{B}\left(\underline{x}\right){\rho}_{A}\left(\underline{x}\right)\mathrm{d}\underline{x}.$ | (2) |

This equation demonstrates using the notion of *overlap* between
3D scalar quantities to give a physics-based scoring function.
If the aim is to find the configuration that gives the most favourable interaction
energy, then it is necessary to perform a six-dimensional search in the space of available
rotational and translational degrees of freedom.
By re-writing the polar Fourier expansions using complex spherical harmonics,
we showed previously
that fast Fourier transform (FFT) techniques may be used to accelerate the search in up to
five of the six degrees of freedom [61].
Furthermore, we also showed that such calculations may be accelerated dramatically on
modern graphics processor units
[10],
[7].
Consequently, we are continuing to explore new ways to exploit the polar Fourier approach.

#### Assembling Symmetrical Protein Complexes

Although protein-protein docking algorithms are improving
[62], [53],
it still remains challenging to produce a
high resolution 3D model of a protein complex using *ab initio* techniques.
This is mainly due to the problem of structural flexibility described above.
However, with the aid of even just one simple constraint on the docking search
space, the quality of docking predictions can improve
considerably [10], [61].
In particular, many protein complexes involve symmetric arrangements of
one or more sub-units, and the presence of symmetry may be exploited to
reduce the search space considerably
[38], [59], [66].
For example,
using our operator notation
(in which $\widehat{R}$ and $\widehat{T}$ represent 3D rotation and translation operators,
respectively),
we have developed an algorithm which can generate and score candidate
docking orientations for monomers
that assemble into cyclic (${C}_{n}$) multimers using 3D integrals of the form

${E}_{AB}(y,\alpha ,\beta ,\gamma )=\int \left[\widehat{T}(0,y,0)\widehat{R}(\alpha ,\beta ,\gamma ){\phi}_{A}\left(\underline{x}\right)\right]\times \left[\widehat{R}(0,0,{\omega}_{n})\widehat{T}(0,y,0)\widehat{R}(\alpha ,\beta ,\gamma ){\rho}_{B}\left(\underline{x}\right)\right]\mathrm{d}\underline{x},$ | (3) |

where the identical monomers A and B are initially placed at the origin, and ${\omega}_{n}=2\pi /n$ is the rotation about the principal $n$-fold symmetry axis. This example shows that complexes with cyclic symmetry have just 4 rigid body degrees of freedom (DOFs), compared to $6(n-1)$ DOFs for non-symmetrical $n$-mers. We have generalised these ideas in order to model protein complexes that crystallise into any of the naturally occurring point group symmetries (${C}_{n}$, ${D}_{n}$, $T$, $O$, $I$). This approach was published in 2016 [8], and was subsequently applied to several symmetrical complexes from the “CAPRI” blind docking experiment [45]. Although we currently use shape-based FFT correlations, the symmetry operator technique may equally be used to build and refine candidate solutions using a more accurate coarse-grained (CG) force-field scoring function.

#### Coarse-Grained Models

Many approaches have been proposed in the literature to take into account protein flexibility during docking. The most thorough methods rely on expensive atomistic simulations using MD. However, much of a MD trajectory is unlikely to be relevant to a docking encounter unless it is constrained to explore a putative protein-protein interface. Consequently, MD is normally only used to refine a small number of candidate rigid body docking poses. A much faster, but more approximate method is to use "coarse-grained" (CG) normal mode analysis (NMA) techniques to reduce the number of flexible degrees of freedom to just one or a handful of the most significant vibrational modes [57], [44], [54], [55]. In our experience, docking ensembles of NMA conformations does not give much improvement over basic FFT-based soft docking [68], and it is very computationally expensive to use side-chain repacking to refine candidate soft docking poses [4].

In the last few years, CG force-field models have become increasingly popular in the MD community because they allow very large biomolecular systems to be simulated using conventional MD programs [37]. Typically, a CG force-field representation replaces the atoms in each amino acid with from 2 to 4 “pseudo-atoms”, and it assigns each pseudo-atom a small number of parameters to represent its chemo-physical properties. By directly attacking the quadratic nature of pair-wise energy functions, coarse-graining can speed up MD simulations by up to three orders of magnitude. Nonetheless, such CG models can still produce useful models of very large multi-component assemblies [65]. Furthermore, this kind of CG model effectively integrates out many of the internal DOFs to leave a smoother but still physically realistic energy surface [50]. We are currently developing a CG scoring function for fast protein-protein docking and multi-component assembly. This work is part of the PhD project of Maria-Elisa Ruiz-Echartea [19], [64]. Beyond this PhD project, the CG scoring function will be exploited in all our docing projects, especially for RNA-Protein docking (see below).

#### Assembling Multi-Component Complexes and Integrative Structure Modeling

We also want to develop related approaches for integrative structure modeling using cryo-electron microscopy (cryo-EM). Thanks to recent developments in cryo-EM instruments and technologies, it is now feasible to capture low resolution images of very large macromolecular machines. However, while such developments offer the intriguing prospect of being able to trap biological systems in unprecedented levels of detail, there will also come with an increasing need to analyse, annotate, and interpret the enormous volumes of data that will soon flow from the latest instruments. In particular, a new challenge that is emerging is how to fit previously solved high resolution protein structures into low resolution cryo-EM density maps. However, the problem here is that large molecular machines will have multiple sub-components, some of which will be unknown, and many of which will fit each part of the map almost equally well. Thus, the general problem of building high resolution 3D models from cryo-EM data is like building a complex 3D jigsaw puzzle in which several pieces may be unknown or missing, and none of which will fit perfectly. We wish to proceed firstly by putting more emphasis on the single-body terms in the scoring function [42], and secondly by using fast CG representations and knowledge-based distance restraints to prune large regions of the search space. This work has made some progress during the PhD project of Maria Elisa Ruiz Echartea but still requires further efforts.

#### Protein-Nucleic Acids Interactions

As well as playing an essential role in the translation of DNA into proteins, RNA molecules carry out many other essential biological functions in cells, often through their interactions with proteins. A critical challenge in modelling such interactions computationally is that the RNA is often highly flexible, especially in single-stranded (ssRNA) regions of its structure. These flexible regions are often very important because it is through their flexibility that the RNA can adjust its 3D conformation in order to bind to a protein surface. However, conventional protein-protein docking algorithms generally assume that the 3D structures to be docked are rigid, and so are not suitable for modeling protein-RNA interactions. There is therefore much interest in developing protein-RNA docking algorithms which can take RNA flexibility into account. This research topic has been initiated with the recruitement of Isaure Chauvot de Beauchêne in 2016 and is becoming a major activity in the team. A novel flexible docking algorithm is currently under development in the team. It first docks small fragments of ssRNA (typically three nucleotides at a time) onto a protein surface, and then combinatorially reassembles those fragments in order to recover a contiguous ssRNA structure on the protein surface [41], [40].

As the correctness of the initial docking of the fragments settles an upper limit to the correctness of the full model, we are now focusing on improving that step. A key component of our docking tool is the energy function of the protein - fragment interactions, that is used both to drive the sampling (positioning of the fragments) by minimization and to discriminate the correct final positions from decoys (i.e. false positives). We are developing a new knowledge-based energy function that will be learnt by machine-learning methods from public structural data on ssRNA-protein complexes.

In the future, we will improve the combinatorial algorithm used for reassembling the docked fragments using experimental constraints and machine-learning approaches.