## Section: Research Program

### Integrative Multi-Component Assembly and Modeling

#### Context

At the molecular level, each PPI is embodied by a physical 3D protein-protein interface. Therefore, if the 3D structures of a pair of interacting proteins are known, it should in principle be possible for a docking algorithm to use this knowledge to predict the structure of the complex. However, modeling protein flexibility accurately during docking is very computationally expensive due to the very large number of internal degrees of freedom in each protein, associated with twisting motions around covalent bonds. Therefore, it is highly impractical to use detailed force-field or geometric representations in a brute-force docking search. Instead, most protein docking algorithms use fast heuristic methods to perform an initial rigid-body search in order to locate a relatively small number of candidate binding orientations, and these are then refined using a more expensive interaction potential or force-field model, which might also include flexible refinement using molecular dynamics (MD), for example.

#### Polar Fourier Docking Correlations

In our *Hex* protein docking program [77],
the shape of a protein molecule is represented using polar Fourier series
expansions of the form

$\sigma \left(\underline{x}\right)=\sum _{nlm}{a}_{nlm}{R}_{nl}\left(r\right){y}_{lm}(\theta ,\phi ),$ | (1) |

where $\sigma \left(\underline{x}\right)$ is a 3D shape-density function,
${a}_{nlm}$ are the expansion coefficients,
${R}_{nl}\left(r\right)$ are orthonormal Gauss-Laguerre polynomials and
${y}_{lm}(\theta ,\phi )$ are the real spherical harmonics.
The electrostatic potential, $\phi \left(\underline{x}\right)$,
and charge density, $\rho \left(\underline{x}\right)$,
of a protein may be represented using similar expansions.
Such representations
allow the *in vacuo* electrostatic interaction energy
between two proteins,
A and B, to be calculated as [60]

$E=\frac{1}{2}\int {\phi}_{A}\left(\underline{x}\right){\rho}_{B}\left(\underline{x}\right)\mathrm{d}\underline{x}+\frac{1}{2}\int {\phi}_{B}\left(\underline{x}\right){\rho}_{A}\left(\underline{x}\right)\mathrm{d}\underline{x}.$ | (2) |

This equation demonstrates using the notion of *overlap* between
3D scalar quantities to give a physics-based scoring function.
If the aim is to find the configuration that gives the most favourable interaction
energy, then it is necessary to perform a six-dimensional search in the space of available
rotational and translational degrees of freedom.
By re-writing the polar Fourier expansions using complex spherical harmonics,
we showed previously
that fast Fourier transform (FFT) techniques may be used to accelerate the search in up to
five of the six degrees of freedom [78].
Furthermore, we also showed that such calculations may be accelerated dramatically on
modern graphics processor units
[10],
[6].
Consequently, we are continuing to explore new ways to exploit the polar Fourier approach.

#### Assembling Symmetrical Protein Complexes

Although protein-protein docking algorithms are improving
[79], [62],
it still remains challenging to produce a
high resolution 3D model of a protein complex using *ab initio* techniques,
mainly due to the problem of structural flexibility described above.
However, with the aid of even just one simple constraint on the docking search
space, the quality of docking predictions can improve
considerably [10], [78].
In particular, many protein complexes involve symmetric arrangements of
one or more sub-units, and the presence of symmetry may be exploited to
reduce the search space considerably
[40], [75], [84].
For example,
using our operator notation
(in which $\widehat{R}$ and $\widehat{T}$ represent 3D rotation and translation operators,
respectively),
we have developed an algorithm which can generate and score candidate
docking orientations for monomers
that assemble into cyclic (${C}_{n}$) multimers using 3D integrals of the form

${E}_{AB}(y,\alpha ,\beta ,\gamma )=\int \left[\widehat{T}(0,y,0)\widehat{R}(\alpha ,\beta ,\gamma ){\phi}_{A}\left(\underline{x}\right)\right]\times \left[\widehat{R}(0,0,{\omega}_{n})\widehat{T}(0,y,0)\widehat{R}(\alpha ,\beta ,\gamma ){\rho}_{B}\left(\underline{x}\right)\right]\mathrm{d}\underline{x},$ | (3) |

where the identical monomers A and B are initially placed at the origin, and ${\omega}_{n}=2\pi /n$ is the rotation about the principal $n$-fold symmetry axis. This example shows that complexes with cyclic symmetry have just 4 rigid body degrees of freedom (DOFs), compared to $6(n-1)$ DOFs for non-symmetrical $n$-mers. We have generalised these ideas in order to model protein complexes that crystallise into any of the naturally occurring point group symmetries (${C}_{n}$, ${D}_{n}$, $T$, $O$, $I$). This approach was published in 2016 [8], and was subsequently applied to several symmetrical complexes from the “CAPRI” blind docking experiment [53]. Although we currently use shape-based FFT correlations, the symmetry operator technique may equally be used to build and refine candidate solutions using a more accurate coarse-grained (CG) force-field scoring function.

#### Coarse-Grained Models

Many approaches have been proposed in the literature to take into account protein flexibility during docking. The most thorough methods rely on expensive atomistic simulations using MD. However, much of a MD trajectory is unlikely to be relevant to a docking encounter unless it is constrained to explore a putative protein-protein interface. Consequently, MD is normally only used to refine a small number of candidate rigid body docking poses. A much faster, but more approximate method is to use CG normal mode analysis (NMA) techniques to reduce the number of flexible degrees of freedom to just one or a handful of the most significant vibrational modes [68], [52], [65], [66]. In our experience, docking ensembles of NMA conformations does not give much improvement over basic FFT-based soft docking [87], and it is very computationally expensive to use side-chain repacking to refine candidate soft docking poses [3].

In the last few years, CG *force-field* models have become
increasingly popular in the MD community because they allow very large
biomolecular systems to be simulated using conventional MD programs
[39].
Typically, a CG force-field representation replaces the atoms in each
amino acid with from 2 to 4 “pseudo-atoms”, and it assigns each pseudo-atom
a small number of parameters to represent its chemo-physical properties.
By directly attacking the quadratic nature of pair-wise energy functions,
coarse-graining can speed up MD simulations by up to three orders of magnitude.
Nonetheless, such CG models can still produce useful models of very
large multi-component assemblies [83].
Furthermore, this kind of coarse-graining effectively integrates out many
of the internal DOFs to leave a smoother but still physically realistic
energy surface [59].
We are therefore developing a “coarse-grained” scoring function for fast
protein-protein docking and multi-component assembly
in the frame of the PhD project of Maria-Elisa Ruiz-Echartea
[31],
[82].

#### Assembling Multi-Component Complexes and Integrative Structure Modeling

We also want to develop related approaches for integrative structure modeling using cryo-electron microscopy (cryo-EM). Thanks to recently developments in cryo-EM instruments and technologies, it is now feasible to capture low resolution images of very large macromolecular machines. However, while such developments offer the intriguing prospect of being able to trap biological systems in unprecedented levels of detail, there will also come an increasing need to analyse, annotate, and interpret the enormous volumes of data that will soon flow from the latest instruments. In particular, a new challenge that is emerging is how to fit previously solved high resolution protein structures into low resolution cryo-EM density maps. However, the problem here is that large molecular machines will have multiple sub-components, some of which will be unknown, and many of which will fit each part of the map almost equally well. Thus, the general problem of building high resolution 3D models from cryo-EM data is like building a complex 3D jigsaw puzzle in which several pieces may be unknown or missing, and none of which will fit perfectly. We wish to proceed firstly by putting more emphasis on the single-body terms in the scoring function [49], and secondly by using fast CG representations and knowledge-based distance restraints to prune large regions of the search space (thesis project of Maria Elisa Ruiz Echartea).