CAPSID

CAPSID - 2023

2023Activity reportProject-TeamCAPSID

RNSR: 201521171B

Research center Inria Centre at Université de Lorraine
In partnership with:Université de Lorraine, CNRS
Team name: Computational Algorithms for Protein Structures and Interactions
In collaboration with:Laboratoire lorrain de recherche en informatique et ses applications (LORIA)
Domain:Digital Health, Biology and Earth
Theme:Computational Biology

Keywords

Computer Science and Digital Science

A3.1.1. Modeling, representation
A3.1.9. Database
A3.1.10. Heterogeneous data
A3.1.11. Structured data
A3.2.1. Knowledge bases
A3.2.2. Knowledge extraction, cleaning
A3.2.4. Semantic Web
A3.2.5. Ontologies
A3.2.6. Linked data
A3.3.2. Data mining
A3.5.1. Analysis of large graphs
A6.1.4. Multiscale modeling
A6.2.7. High performance computing
A6.3.3. Data processing
A6.5.5. Chemistry
A8.2. Optimization
A9.1. Knowledge
A9.2. Machine learning

1 Team members, visitors, external collaborators

Research Scientists

Marie-Dominique Devignes [Team leader, CNRS, Researcher, HDR]
Isaure Chauvot de Beauchêne [CNRS, Researcher]
Yasaman Karami [INRIA, Researcher]
Hamed Khakzad [INRIA, Advanced Research Position]
Bernard Maigret [CNRS, Emeritus]

Faculty Members

Sabeur Aridhi [UL, Associate Professor, from Jul 2023]
Sabeur Aridhi [UL, Associate Professor Delegation, until Jun 2023]
Malika Smaïl-Tabbone [UL, Associate Professor, HDR]

Post-Doctoral Fellow

Dominique Mias-Lucquin [UL, Post-Doctoral Fellow, until Jun 2023]

PhD Students

Diego Amaya Ramirez [UL, ATER, until Sep 2023]
Kamrul Islam [UL, until Jan 2023]
Mohammed Khatbane [UL, from Oct 2023]
Omid Mokhtari [INRIA, from Oct 2023]
Victor Pryakhin [UL, from Oct 2023]

Technical Staff

Hrishikesh Dhondge [CNRS, Engineer, until Aug 2023]
Anna Kravchenko [CNRS, Engineer]
Athénaïs Vaginay [CNRS, Engineer, until Mar 2023]
Taher Yacoub [CNRS, Engineer, from Nov 2023]

Interns and Apprentices

Benjamin Gottis [CNRS, Intern, from Feb 2023 until Jul 2023]
Jean-Baptiste Paquin [UL, Intern, from Mar 2023 until Apr 2023]
Hugo Rimet [UNIV NANTES, Intern, from Mar 2023 until May 2023]

Administrative Assistants

Antoinette Courrier [CNRS]
Sophie Drouot [INRIA]

External Collaborators

Taha Boukhobza [UL, HDR]
Emmanuel Bresso [CHRU Nancy]
Pablo Chacon [Université de Madrid]

2 Overall objectives

2.1 Computational Challenges in Structural Biology

NB: This section has been remodeled since the death of Dave Ritchie in 2019.

Many of the processes within living organisms can be studied and understood in terms of biochemical interactions between large macromolecules such as DNA, RNA, and proteins. To a first approximation, DNA may be considered to encode the blueprint for life, whereas proteins and RNA make up the three-dimensional (3D) molecular machinery. Many biological processes are governed by complex systems of proteins and/or RNA which interact cooperatively to regulate the chemical composition within a cell or to carry out a wide range of biochemical processes such as photosynthesis, metabolism, and cell signalling, for example. It is becoming increasingly feasible to isolate and characterise some of the individual molecular components of such systems, but it still remains extremely difficult to achieve detailed models of how these complex systems actually work. Consequently, a new multidisciplinary approach called integrative structural biology has emerged which aims to bring together experimental data from a wide range of sources and resolution scales in order to meet this challenge 58, 72.

Understanding how biological systems work at the level of 3D molecular structures presents fascinating challenges for biologists and computer scientists alike. Despite being made from a small set of simple chemical building blocks, protein and nucleic acid (NA) molecules have a remarkable ability to self-assemble into complex molecular machines which carry out very specific biological processes. As such, these molecular machines may be considered as complex systems because their properties are much greater than the sum of the properties of their component parts.

2.2 Two Research Axes

The overall objective of the CAPSID team is to develop algorithms and software to help study biological systems and phenomena from a structural point of view. In particular, the team aims to develop algorithms which can help to model the structures of large multi-component biomolecular machines and to develop tools and techniques to represent and mine knowledge of the 3D shapes of proteins, NA and their interactions. Thus, a unifying theme of the team is to tackle the recurring problem of representing and reasoning about large 3D macromolecular shapes. More specifically, our aim is to develop computational techniques to represent, analyse, and compare the shapes and interactions of biomolecules in order to better understand how their 3D structures relate to their biological function. In summary, the CAPSID team is organised according to two research axes whose complementarity constitutes an original contribution to the field of structural bioinformatics:

Axis 1: Knowledge Discovery in Structural Databases,
Axis 2: Integrative Multi-Component Assembly and Modeling.

In the first axis, our main objective is to design, implement and test new KDD ("Knowledge Discovery in Databases") approaches to exploit specifically the structural information contained and sometimes hidden in many biological databases. These approaches will be oriented towards understanding molecular interactions in living organisms under physiological or pathological conditions.

In the second axis, our main objective is to propose new and fast methods to model the 3D structure of multi-component systems and characterize their dynamic behaviour. The challenge here is to integrate molecular flexibility into 3D models, thanks to molecular dynamics simulation and/or combinatorial approaches.

Finally, the complementarity of the two axes will be expressed through a common objective oriented towards the proposal of possible new treatments against diseases, based on the knowledge extracted and on the advances in 3D modeling of flexible molecular interactions. This objective will benefit from our network of biologist and clinician partners.

3 Research program

This section presents the current CAPSID research program. Several subjects initially present at the creation time (2015) or at last evaluation (2017) are no longer presented due to the death of Dave Ritchie.

3.1 Knowledge Discovery in Structural Databases

3.1.1 Context

In this axis, the CAPSID team develops methods related to knowledge discovery from databases (KDD, 36). The diversity of biological databases and resources is such today that it is more and more difficult to consider each database independently from the others 62. A limited subset of these resources is devoted to the 3D structure of biological objects (proteins, nucleic acids, glycanes...). Structural information is also contained in databases classifying protein domains as building blocks of proteins that can be reused in different proteins sharing the same function (Pfam, CATH and InterPro are well-known examples of such databases) 56, 67, 26. There are millions of proteins across all living species but only tens of thousands of domains that are combined in proteins. Thus, complex tasks such as predicting protein function or interactions can be simplified when envisaged at the domain level.

Due to the great diversity of databases, Knowledge Graphs (KGs) are more and more used to represent and integrate biological information. There is no single definition of KGs as these graphs cover a large variety of domains and data representation contexts (for instance the GAFAM companies advertize various KG uses). The main feature that differentiates KGs from classical graphs is the fact that both nodes (or entities) and edges (or relations) in the graph are heterogeneous and belong to various types described in the KG schema (metagraph). The field of biological knowledge discovery in KG is expanding rapidly 52. Most biological KGs today are developed for drug repurposing tasks (e.g. HetioNet 41 or DRKG). Clinicians are also very interested in network science carried out on rich knowledge graphs as a mean to interpret biomarker studies. However, there is still a need for curated, reliable biological KGs and for efficient knowledge discovery methods in KGs.

3.1.2 Knowledge discovery from protein structural databases

Concerning protein structural databases, we aim to explore novel classification paradigms exploiting existing resources about protein folds and domains 21, 22, 56, 67. In particular it will be interesting to use Kpax, our structural alignment tool 63, to define domain-domain similarity matrices. A non-trivial issue with clustering is how to handle similarity rather than mathematical distance matrices, and how to find the optimal number of clusters, especially when using a non-Euclidean similarity measure. We will adapt the algorithms and the calculation of quality indices to the Kpax similarity measure. More fundamentally, it will be necessary to integrate this classification step in the more general KDD process leading from data to knowledge.

For example, protein domain classification is relevant for studying domain-domain interactions (DDI). Our previous work on Knowledge-Based Docking (KBDOCK, 38, 40) will be updated and extended using newly published DDIs. Methods for inferring new DDIs from existing protein-protein interactions (PPIs) will be developed. Efforts should be made for validating such inferred DDIs so that they can be used to enrich DDI classification and predict new PPIs.

3.1.3 Function Annotation in Large Protein Graphs

Knowledge of the functional properties of proteins can shed considerable light on how they might interact. However, huge numbers of protein sequences in public databases such as UniProt/TrEMBL lack any functional annotation, and the functional annotation of such sequences is a highly challenging problem. We are developing graph-based and machine learning techniques to annotate automatically the available unannotated sequences with functional properties such as Enzyme Commission (EC) numbers and Gene Ontology (GO) terms (note that these terms are organised hierarchically allowing generalization/specialization reasoning). The idea is to transfer annotations from expert-reviewed sequences present in the UniProt/SwissProt database (about 560 thousands entries) to unreviewed sequences present in the UniProt/TrEMBL database (about 80% of 180 millions entries). For this, we have to learn from the UniProt/SwissProt database how to compute the similarity of proteins sharing identical or similar functional annotations. Various similarity measures can be tested using cross-validation approches in the UniProt/SwissProt database. For instance, we can use primary sequence or domain signature similarities. More complex similarities can be computed with graph-embedding techniques.

3.1.4 Knowledge discovery algorithms in large biological knowledge graphs

KGs are particularly useful and appropriate in biology, to represent and integrate the complex contents of biological databases 60. We intend to design algorithms for leveraging information embedded in biological KGs (also known as complex networks). In biology, KGs mostly represent PPIs, integrated with various properties attached to proteins, such as pathways, drug binding or relation with diseases. Setting up similarity measures for proteins in a knowledge graph is a difficult challenge. Our objective is to extract useful knowledge from such graphs in order to better understand and highlight the role of multi-component assemblies in various types of cell or organisms. Ultimately, knowledge graphs can be used to model and simulate the functioning of such molecular machinery in the context of the living cell, under physiological or pathological conditions.

3.2 Integrative Multi-Component Assembly and Modelling

3.2.1 Context

This axis deals with 3D protein structure and interactions. In fact, the long-lasting problem of predicting a 3D structure from a protein sequence has been solved in 2021 by the AlphaFold2 (DeepMind) 46 or RosettaFold methods 24. This success, revealed in the CASP14 (Critical Assessment of Stucture Prediction) challenge, was possible not only thanks to AI methods but also because the amount of experimental 3D structures has reached a sufficient size in the Protein Data Bank (PDB). For the same type of reasons, the rigid docking problem (in which the bodies to dock are rigid) seems to be on the way to being solved as well 28, 35. However, research is still required to address the problem of docking disordered proteins or flexible nucleic acids that will fold as they bind to proteins. This is the direction taken by the team since the arrival of Isaure Chauvot de Beauchêne, the inventor of a fragment-based approach for RNA docking onto proteins.

Modeling protein - and even more RNA - flexibility accurately during docking is very computationally expensive. This is due to the very large number of internal degrees of freedom in each molecule, associated with twisting motions around covalent bonds. Therefore, it is highly impractical to use detailed force-field or geometric representations in a brute-force docking search. Instead, most docking algorithms use fast heuristic methods to perform an initial rigid-body search in order to locate a relatively small number of candidate binding orientations, and these are then refined using a more expensive interaction potential or force-field model, which might also include flexible refinement using molecular dynamics (MD), for example.

3.2.2 Coarse-Grained Models

Many approaches have been proposed in the literature to take into account protein (and more recently RNA/DNA) flexibility during docking. The most thorough methods rely on expensive atomistic simulations using MD. However, much of a MD trajectory is unlikely to be relevant to a docking encounter unless it is constrained to explore a putative protein-protein/NA interface. Consequently, MD is normally only used to refine a small number of candidate rigid body docking poses. A much faster - but more approximate - method is to use "coarse-grained" (CG) normal mode analysis (NMA) techniques to reduce the number of flexible degrees of freedom to just one or a handful of the most significant vibrational modes 34, 54, 57, 59. In our experience, docking ensembles of NMA conformations do not give much improvement over basic FFT-based soft docking 71, and it is very computationally expensive to use side-chain repacking to refine candidate soft docking poses 39.

In the last few years, CG force-field models have become increasingly popular in the MD community because they allow very large biomolecular systems to be simulated using conventional MD programs 23. Typically, a CG force-field representation replaces the 5-15 atoms in each amino acid with 2-4 “pseudo-atoms” (each pseudo-atom represents few atoms of an amino-acid as a single bead). It then assigns each pseudo-atom a small number of parameters to represent its chemo-physical properties. By directly attacking the quadratic nature of pair-wise energy functions, coarse-graining can speed up MD simulations by up to three orders of magnitude. Nonetheless, such CG models can still produce useful models of very large multi-component assemblies 66. Furthermore, this type of CG model effectively integrates many internal DOFs to build a smoother but still physically realistic energy surface 42. We are currently developing a CG scoring function for RNA-protein docking by fragments assembly.

3.2.3 Assembling Multi-Component Complexes and Integrative Structure Modelling

We also want to develop related approaches for integrative structure modeling using cryo-electron microscopy (cryo-EM). Thanks to recent developments in cryo-EM instruments and technologies, it is now feasible to capture low resolution images of very large macromolecular machines. However, while such developments offer the intriguing prospect of being able to trap biological systems in unprecedented levels of detail, there will also come with an increasing need to analyse, annotate, and interpret the enormous volumes of data that will soon flow from the latest instruments. In particular, a new challenge that is emerging is how to fit previously solved high resolution protein structures into low resolution cryo-EM density maps. However, the problem here is that large molecular machines will have multiple sub-components, some of which will be unknown, and many of which will fit each part of the map almost equally well. Thus, the general problem of building high resolution 3D models from cryo-EM data is like building a complex 3D jigsaw puzzle in which several pieces may be unknown or missing, and none of which will fit perfectly. We wish to proceed firstly by putting more emphasis on the single-body terms in the scoring function 33, and secondly by using fast CG representations and knowledge-based distance restraints to prune large regions of the search space, as initiated with the EROS-DOCK software 6, 65.

3.2.4 Protein-Nucleic Acid Interactions

As well as playing an essential role in the translation of DNA into proteins, RNA molecules carry out many other essential biological functions in cells, often through their interactions with proteins. A critical challenge in modeling such interactions computationally is that the RNA is often highly flexible, especially in single-stranded (ssRNA) regions of its structure. These flexible regions are often very important because it is through their flexibility that the RNA can adjust its 3D conformation in order to bind to a protein surface. However, conventional protein-protein docking algorithms generally assume that the 3D structures to be docked are rigid, and so are not suitable for modeling protein-RNA interactions. There is therefore much interest in developing dedicated protein-RNA docking algorithms which can take RNA flexibility into account. This research topic has been initiated with the recruitement of Isaure Chauvot de Beauchêne in 2016 and is becoming a major activity in the team. A novel flexible docking algorithm is currently under development in the team. It first docks small fragments of ssRNA (typically three nucleotides at a time) onto a protein surface, and then combinatorially reassembles those fragments in order to recover a contiguous ssRNA structure on the protein surface 29, 30.

As the correctness of the initial docking of the fragments settles an upper limit to the correctness of the full model, we are now focusing on improving that step. A key component of our docking tool is the energy function of the protein-fragment interactions that is used both to drive the sampling (positioning of the fragments) by minimization, and to discriminate the correct final positions from decoys (i.e., false positives). We are developing a new approach to create knowledge-based parameters for coarse-grain energy functions from public structural data, in collaboration with Sjoerd de Vries (INSERM). Such approach will be applied first to ssRNA-protein complexes, then to other types of complexes such as protein-peptides.

Another key requirement for this approach is an exhaustive but non-redundant library of possible internal conformations of RNA fragments. Our library is built by clustering hundreds of thousands of experimentally known RNA structures, based on an approximate geometric similarity criteria. We want to develop new algorithms for the clustering of 3D conformations based on internal coordinates and on epsilon-net theory, in order to optimise the representativity and computational cost of the library.

In the future, we will improve the combinatorial algorithm used for reassembling the docked fragments using both experimental constraints and knowledge-based constraints pertaining from the research carried out in Axis 1.

4 Application domains

4.1 Biomedical Knowledge Discovery

Participants: Marie-Dominique Devignes [contact person], Malika Smaïl-Tabbone [contact person], Sabeur Aridhi, Kamrul Islam, Athénaïs Vaginay.

Our main application for Axis 1 : "New Approaches for Knowledge Discovery in Structural Databases", concerns biomedical knowledge discovery. We intend to develop KDD approaches on preclinical (experimental) or clinical datasets integrated with knowledge graphs with a focus on discovering which PPIs or molecular machines play an essentiel role in the onset of a disease and/or for personalised medicine.

As a first step we have been involved since 2015 in the ANR RHU “FIGHT-HF” (Fight Heart Failure) project, which is coordinated by the CIC-P (Centre d'Investigation Clinique Plurithématique) at the CHRU Nancy and INSERM U1116. In this project, the molecular mechanisms that underly heart failure (HF) are re-visited at the cellular and tissue levels in order to adapt treatments to patients' needs in a more personalised way. The CAPSID team is in charge of a workpackage dedicated to network science. A platform has been constructed with the help of a company called Edgeleap (Utrecht, NL) in which biological molecular data and ontologies, available from public sources, are represented in a single integrated complex network also known as knowledge graph. We are developing querying and analysis facilities to help biologists and clinicians interpreting their cohort results in the light of existing interactions and knowledge. We are also currently analysing pre-clinical data produced at the INSERM unit on the comparison of aging process in obese versus lean rats. Using our expertise in receptor-ligand docking, we are investigating possible cross-talks between mineralocorticoid and other nuclear receptors.

Another application is carried out in the context of an interdisciplinary project funded by the Université de Lorraine, in collaboration with the CRAN laboratory. It concerns the study of the role of estrogen receptors in the development of gliobastoma tumors. The available data is high-dimensional but involves rather small numbers of samples. The challenge is to identify relevant sets of genes which are differentially expressed in various phenotyped groups (w.r.t. gender, age, tumor grade). The objectives are to infer pathways involving these genes and to propose candidate models of tumor development which will be experimentally tested thanks to an ex-vivo experimental system available at the CRAN.

Finally, simulating biological networks will be important to understand biological systems and test new hypotheses. One major challenge is the identification of perturbations responsible for the transformation of a healthy system to a pathological one and the discovery of therapeutic targets to reverse this transformation. Control theory, which consists in finding interventions on a system in order to prevent it to go in undesirable states or to force it to converge towards a desired state, is of great interest for this challenge. It can be formulated as “How to force a broken system (pathological) to act as it should do (normal state)?”. Many formalisms are used to model biological processes, such as Differential Equations (DE), Boolean Networks (BN), cellular automata. In her PhD thesis, Athenaïs Vaginay investigates ways to find a BN fitting both the knowledge about topology and state transitions “inferred“ from experimental data. This step is known as “boolean function synthesis”. Our aim is to design automated methods for building biological networks and define operators to intervene on them 70. Our approaches will be driven by knowledge and will keep close connection with experimental data.

4.2 Prokaryotic Type IV Secretion Systems

Participants: Isaure Chauvot de Beauchêne [contact person], Marie-Dominique Devignes, Bernard Maigret, Dominique Mias-Lucquin.

Concerning Axis 2 : "Integrative Multi-Component Assembly and Modeling", our first application domain is related to prokaryotic Type IV secretion systems.

Prokaryotic type IV secretion systems constitute a fascinating example of a family of nanomachines capable of translocating DNA and protein molecules through the cell membrane from one cell to another 20. The complete system involves at least 12 proteins. The structure of the core channel involving three of these proteins has recently been determined by cryo-EM experiments for Gram-negative bacteria 37, 64. However, the detailed nature of the interactions between the other components and the core channel remains to be found. Therefore, these secretion systems represent a family of complex biological systems that call for integrated modeling approaches to fully understand their machinery.

In the framework of the Lorraine Université d'Excellence (LUE-FEDER) “CITRAM” project we are pursuing our collaboration with Nathalie Leblond of the Genome Dynamics and Microbial Adaptation (DynAMic) laboratory (UMR 1128, Université de Lorraine, INRAE) on the mechanism of horizontal transfer by integrative conjugative elements (ICEs) and integrative mobilisable elements (IMEs) in prokaryotic genomes. These elements use type IV secretion systems for transferring DNA horizontally from one cell to another. We have discovered more than 200 new ICEs/IMEs by systematic exploration of 72 Streptococcus genomes and characterised a new class of relaxases 68. We have modeled the dimer of this relaxase protein by homology with a known structure. For this, we have created a new pipeline to model symmetrical dimers of multi-domains proteins. As one activity of the relaxase is to cut the DNA for its transfer, we are also currently studying the DNA-protein interactions that are involved in this very first step of horizontal transfer (see next section).

4.3 Protein - RNA Interactions

Participants: Isaure Chauvot de Beauchêne [contact person], Antoine Moniot, Anna Kravchenko, Hrishikesh Dhondge, Marie-Dominique Devignes, Malika Smaïl-Tabbone.

The second application domain of Axis 2 concerns protein-nucleic acid interactions. We need to assess and optimise our new algorithms on concrete protein-nucleic acid complexes in close collaboration with external partners coming from the experimental field of structural biology. To facilitate such collaborations, we are creating automated and re-usable protein-nucleic acid docking pipelines.

This is the case for our PEPS collaboration “InterANRIL” with the IMoPA lab (CNRS-Université de Lorraine). We are currently working with biologists to apply our fragment-based docking approach 30 to model complexes of the long non-coding RNA (lncRNA) ANRIL with proteins and DNA 19.

In the framework of our LUE-FEDER CITRAM project (see above), we are adapting this approach and pipeline to single-strand DNA docking, in order to model the complex formed by a bacterial relaxase and its target DNA 68.

In the framework of our H2020 ITN project RNAct, we tackle a defined group of RNA-binding proteins containing RNA-Recognition Motifs (RRM) 53, 31. We study existing and predicted complexes between various types of RRMs and various RNA sequences in order to infer rules of their sequence-structure-interaction relationship, and to help design new synthetic proteins with targeted RNA specificity. This work is made in tight collaboration with computer scientists and biophysicists of the consortium.

4.4 3D structural differences among HLA antigens

Participants: Marie-Dominique Devignes [contact person], Malika Smaïl-Tabbone, Diego Amaya Ramirez, Bernard Maigret.

This application domain has emerged in Axis 2 through the Inria-Inserm PhD thesis project of Diego Amaya Ramirez, in collaboration with the Immunology and Histocompatibility Laboratory at the APHP Saint-Louis Hospital in Paris. Differences between donor and recipient HLA proteins are one of the major limitations of organ transplant because of HLA ubiquity on cells of tissues and organs 27. Indeed, in case of incompatibility between the HLA proteins of the donor and those of the patient, an immune response is triggered in the patient that can result in rejection of the transplanted organ. The thesis project aims at deciphering the role played by tiny 3D structure differences between donor and recipient HLA proteins in determining the production of donor-specific antibodies by the recipient. We are currently developing methods to compare local structure variations between HLA proteins, taking into account the dynamics of these proteins.

4.5 Investigating the dynamical behaviour of protein complexes with MD simulation

Participants: Yasaman Karami [contact person], Malika Smaïl-Tabbone, Isaure Chauvot de Beauchêne, Viktor Pryakhin.

Allosteric pathways in protein-nucleic acid complexes.

Novel methodological approaches based on deep learning (AlphaFold2 and RosettaFold) have started to make remarkable advances in protein structure prediction and design. However, our knowledge regarding their dynamical behaviour and function is still highly limited. One important example is the notion of allostery which refers to processes whereby a binding event at one site of a biological macromolecule affects the binding activity at another distinct functional site, enabling the regulation of the corresponding function. The allosteric behavior of a macromolecular system arises from the properties of the native free-energy landscape of the system, and how this landscape is remodeled by various perturbations, such as ligand binding, protonation, mutations, post-translational modifications, or interactions with other molecules. Therefore, understanding allosteric pathways of communication is of high importance and is not well understood yet. All-atom MD simulations could be used to capture subtle dynamical changes that are associated with allosteric signaling. Moreover, graph theory-based methods were developed to investigate the set of trajectories generated by MD simulations and extract allosteric pathways. Yasaman Karami has previously developed a method called COMMA (COMmunication Mapping) that describes the dynamical architecture of a protein by predicting the network of communications within the system 48 and she is constantly working at improving it.

In 2023, Benjamin Gottis was hired for an M2 internship (6 months, February-July) to improve COMMA for the analysis of large protein complexes. His internship resulted in the improvement of threshold definitions in COMMA and accelerating the calculations. Moreover, Yasaman Karami and Malika Smaïl-Tabbone recruited a PhD student, Victor Pryakhin who obtained a doctoral contract from Université de Lorraine and started in October 2023. His PhD subject is to investigate communication networks in protein-RNA complexes using deep learning approaches. At the same time, Yasaman Karami has obtained computational resources on Jean Zay super computer to perform MD simulations on a set of protein-RNA complexes. The results of this computations will be used in Viktor Pryakhin's doctoral project.

Dynamics of Type IV Pilus.

Type IV pili (T4P) are dynamic filaments at the surface of many bacteria that can rapidly extend and retract and withstand strong forces. T4P are responsible for various cellular processes due to their highly dynamic behavior and are crucial for bacterial virulence in many human pathogens, including Enterohemorrhagic Escherichia coli (EHEC), Pseudomonas aeruginosa (PaK), Neisseria meningitidis (Nm), Neisseria gonorrhoeae (Ng), and Myxococcus xanthus (Mx). They are assembled by complex protein machinery localized in the bacterial envelope, formed by the repeat of major pilin subunits organised in a helical manner. The structure of these T4Ps have been determined by combining NMR with a cryo-EM density map of the pilus, resulting in an atomistic model of the T4Ps filament. Despite the recent progress in cryo-EM and integrative structural biology, the high flexibility of this family of fibers often limits the resolution of the structure. Modeling is therefore a necessary step in structure determination. To unravel the structural basis of different T4Ps dynamics, we performed extensive and large-scale classical and steered MD simulations of five different T4Ps, EHEC, PaK, Nm, Ng and Mx. The results highlighted key regions for the function of those filaments. Such simulations require an important number of computational resources, which were provided thanks to the Swiss National Supercomputing Centre (CSCS) and GENCI computing center. This project is one of the main activities of Yasaman Karami, in collaboration with Michael Nilges (Institut Pasteur, Paris) and Edward Egelman (University of Virginia, USA).

4.6 Learning from protein surfaces to accurately predict their functional sites.

Participants: Hamed Khakzad [contact person], Marie-Dominique Devignes, Omid Mokhtari.

Today, deep learning from large datasets of sequences and structures (AlphaFold, RosettaFold) enables structure prediction from sequence with remarkable accuracy. However, the performance of these methods on predicting protein-protein interactions (PPIs) is still arguable as in the case of AlphaFold multimer 35, 51. More importantly, none of these methods can address the interactions between other types of macromolecules such as DNA, RNA, and ligands. Taking a different stance, this project is focused on a different aspect of macromolecular structure; the surface. Extracting and learning surface features, which are critical for function including the interaction with other macromolecules, has been less explored than other aspects of macromolecular structures. The idea here is that the surface geometrical and chemical patterns could be used to understand and predict important aspects of macromolecular functions. Although these features are difficult to determine visually, they can be learned by a deep neural network trained over large-scale datasets. This project will attempt to create a broader methodological framework in comparison to the existing approaches that can be used to capture aspects of surface conformational diversity, and interactions with different types of macromolecules including small molecules (ligands), protein-RNA, protein-DNA, and protein- protein interactions. This project is part of Hamed Khakzad's ANR-CPJ funding, and a PhD student (Omid Mokhtari) was recruited in October 2023 to work on this topic.

5 Social and environmental responsibility

5.1 Environmental Footprint of Research Activities

In structural bioinformatics and deep learning approaches, the computational costs are usually very high. The CAPSID team pays attention to use shared equipment (Platform MBI-DS4H, Grid5K) for running HPC (High Performance Computing) jobs as efficiently as possible. In particular, we use the "best effort" mode for distributed jobs.

When travelling to conferences, members of the CAPSID team prefer the train solution as often as possible.

6 New software, platforms, open data

6.1 New software

6.1.1 CROMAST

Name:
Cross-Mapper of domain Structural instances
Keywords:
Protein domain, 3D structure, Classification, Databases, Workflow
Scientific Description:
see the scientific paper 10.1093/bioadv/vbad081
Functional Description:
CroMaSt (Cross Mapper of domain Structural instances) is an automated iterative workflow to clarify the assignment of protein domains to a given domain type of interest, based on their 3D structure and by cross-mapping of domain structural instances between domain databases. CroMaSt (for Cross-Mapper of domain Structural instances) will classify all structural instances of a given domain type into 4 different categories (Core, True, Domain-like, and Failed)
Release Contributions:
Operational version for publication
URL:
https://workflowhub.eu/workflows/390
Publication:
hal-04210856
Contact:
Hrishikesh Dhondge

6.1.2 HIPPO

Name:
HIstogram-based Pseudo-POtential
Keywords:
Structural Biology, Computational biology
Functional Description:
Pipeline to create scoring potentials from docking decoys
Contact:
Isaure Chauvot De Beauchêne

6.1.3 RRMpip

Name:
RRM_modeling_pipeline
Keywords:
Structural Biology, Computational biology
Functional Description:
pipeline to create 3D models of RRM protein domains
Contact:
Isaure Chauvot De Beauchêne

6.1.4 ssRNATTRACT

Keywords:
Computational biology, Structural Biology
Functional Description:
3D modelling of protein-ssRNA complexes by fragment assembly
URL:
https://github.com/isaureCdB/ssRNATTRACT
Contact:
Isaure Chauvot De Beauchêne

6.2 New platforms

Participants: Marie-Dominique Devignes [scientific responsible], Malika Smaïl-Tabbone [contact person], Sabeur Aridhi, Bernard Maigret, Antoine Moniot, Diego Amaya Ramirez.

The CAPSID team is at the origin of the creation of the LORIA MBI-DS4H research platform that provides a shared environment to the CAPSID and ORPAILLEUR teams for running distributed intensive computation. This platform is also the place for optimizing codes that can be run later on Grid 5K or on the Jean-Zay supercalculator. Moreover, the platform offers opportunities for newcomers in the team to get trained to good practices in development and in sharing code and data.

The technical support of the platform is ensured by the LORIA SISR (Service d'Ingénierie en Soutien de la Recherche) via a private project on gitlab.

6.3 Open data

Benchmark simulated datasets for clustering of mixed data

Description
We deposited in the Inria open data repository seven sets of synthetic datasets to be used for benchmarking clustering algorithms for mixed (continuous and categorical) data. The synthetic datasets correspond to 9 simulation designs described in the README file and refer to a 2021 publication in Scientific Reports 61
Contact
Marie-Dominique Devignes
url
doi:10.57745/6IFQYQ

Benchmark of proteins bound to ssDNA

Description
We explored the Protein Data Bank (PDB) to collect protein-ssDNA structures and create a multi-conformational docking benchmark including both bound and unbound protein structures. Due to ssDNA high flexibility when not bound, no ssDNA unbound structure is included in the benchmark. This benchmark is, to our knowledge, the first one made to peruse available structures of ssDNA-protein interactions to such an extent, aiming to improve computational docking tools dedicated to this kind of molecular interactions. Related publication: 55.
Contact
Isaure Chauvot de Beauchêne
url
doi:10.57745/3W8CCV

MD simulations and ML dataset of HLA-EpiCheck epitope predictor tool

Description
This dataset contains all the data used to implement the B-cell epitope predictor tool called HLA-EpiCheck (preprint available on bioRXiv).
Contact
Diego Amaya-Ramirez and Marie-Dominique Devignes
url
doi:10.57745/GXZHH8

Experiences with a training DSW knowledge model for early-stage researchers

Description
Data management is fast becoming an essential part of scientific practice, driven by open science and FAIR (findable, accessible, interoperable, and reusable) data sharing requirements. Whilst data management plans (DMPs) are clear to data management experts and data stewards, understandings of their purpose and creation are often obscure to the producers of the data, which in academic environments are often PhD students. Within the RNAct EU Horizon 2020 ITN project, we engaged the 10 RNAct early-stage researchers (ESRs) in a training project aimed at formulating a DMP. To do so, we used the Data Stewardship Wizard (DSW) framework and modified the existing Life Sciences Knowledge Model into a simplified version aimed at training young scientists, with computational or experimental backgrounds, in core data management principles. We collected feedback from the ESRs during this exercise. Here, we introduce our new life-sciences training DMP template for young scientists. We report and discuss our experiences as principal investigators (PIs) and ESRs during this project and address the typical difficulties that are encountered in developing and understanding a DMP. We found that the DS-wizard can also be an appropriate tool for DMP training, to get terminology and concepts across to researchers 7.
Contact
Malika Smaïl-Tabbone and Marie-Dominique Devignes
url
doi:10.12688/openreseurope.15609

7 New results

7.1 Axis 1 : Knowledge Discovery in Structural Databases

Participants: Marie-Dominique Devignes, Malika Smaïl-Tabbone, Sabeur Aridhi, Kamrul Islam, Athénaïs Vaginay.

7.1.1 Knowledge graph mining with embedding-based methods

In the context of Md Kamrul Islam's PhD project, we addressed the problem of link prediction in large knowledge graphs (KGs) using KG embedding methods. These methods aim to learn low dimensional vector representations of entities and relations in a KG. Such representations (in a latent space) facilitate link prediction tasks along with other downstream tasks. In this context, it is important to achieve both an efficient KG embedding and explainable predictions. During learning of efficient embeddings, sampling negative triples is an important step as KGs only come with observed positive triples. We previoulsy proposed an efficient simple negative sampling (SNS) method based on the assumption that the entities which are closer to the corrupted entity in the embedding space are able to provide high-quality negative triples 43. As for explainability, we also report in the same paper a new rule mining method which exploits the learned embeddings 43.

We then extended this work to propose an integrated drug repurposing, evaluation and explanation pipeline for COVID-19 disease 11. The workflow starts with collecting and cleaning a COVID-19 centric drug repurposing knowledge graph (DRKG). Then, high-quality and compact ensemble embeddings are learned using three embedding methods. The embeddings are then used to train a deep neural network based model to predict the probability of unobserved triples connecting drugs with 27 COVID-19 proteins (as drug targets). The top-100 predictions are evaluated based on (i) cross-matching with in-trial drugs for COVID-19 and (ii) molecular evaluation based on compound and protein structures. Beside these evaluations, we learn high quality rules from DRKG and provide possible explanations of predictions. This study demonstrates how complementary embedding methods can be used to generate high-quality ensemble embeddings of a KG and how to use embeddings for the drug repurposing task. To the best of our knowledge, it is the first attempt to combine virtual screening methods with KG embedding methods in predicting and evaluating repurposable drugs for COVID-19. Besides the retrieval of many in-trial drugs, both methods show a converging result that the Fosinopril compound could be a new potential nsp13 inhibitor. Experimental validation of Fosinopril compound to treat COVID-19 is a potential perspective of this study. The molecular evaluation results and explanations of the predictions make us confident about the drawn conclusions. Md Kamrul Islam has successfully defended his PhD on December 16, 2022 44.

7.1.2 Biological network modeling

Boolean Networks (BNs) refer to a simple formalism used to study complex biological systems when the prediction of exact reaction times is not of interest. BNs play a key role in understanding the dynamics of the studied systems and in predicting their disruption in case of complex human diseases. The BioModels database is a well-known repository of peer-reviewed models represented in the Systems Biology Markup Language (SBML). Most of these models are quantitative, but in some use cases, qualitative models—such as BNs—are better suited. In the context of Athénais Vaginay's PhD project, we proposed SBML2BN, a pipeline dedicated to the automatic transformation of quantitative SBML models to Boolean networks 69. Our approach takes advantage of several SBML elements (reactions, rules, events) as well as a numerical simulation of the concentration of the species over time to constrain both the structure and the dynamics of the Boolean networks to synthesise. Finding all the BNs complying with given structure and dynamics was formalised as an optimisation problem formulated in the answer-set programming framework.

We ran SBML2BN on more than 200 quantitative SBML models, and we could construct Boolean networks which are compatible with the structure and the dynamics of the SBML models 69. The most recent work relies on abstract simulation of a chemical reaction network (CRN) to avoid the tricky binarization task as we propose to simulate chemical reaction networks with the deterministic semantics abstractly, without any precise knowledge on the initial concentrations. For this, the concentrations of species are abstracted to Booleans stating whether the species is present or absent, and the derivatives of the concentrations are abstracted to signs saying whether the concentration is increasing, decreasing, or unchanged. We use abstract interpretation over the structure of signs for mapping the ODEs of a reaction network to a Boolean network with nondeterministic updates. The abstract state transition graph of such Boolean networks can be computed by finite domain constraint programming over the finite structure of signs. Constraints on the abstraction of the initial concentrations can be added naturally, leading to an abstract simulation algorithm that produces only the part of the abstract state transition graph that is reachable from the abstraction of the initial state. We proved the soundness of our abstract simulation algorithm, and showed its applicability to reaction networks in the SBML format from the BioModels database 14. Athénaïs Vaginay has successfully defended her PhD on July, 7 2023 16.

7.1.3 Machine Learning and Scalable Graph-based Approaches

In the context of a collaboration with researchers from both the University Badji Mokhtar-Annaba (Algeria), the University of Quebec At Montreal (UQAM) and the University of Lille, Sabeur Aridhi proposed a genetic algorithm for random forest 12. The proposed algorithm has three main objectives: (1) strengthening the classification accuracy of individual decision trees as well as that of the forest, (2) making use of diversity measures among the decision trees to improve the generalization of the constructed model, and (3) minimizing the number of trees in the forest and finding an optimal subset of the random forest.

7.2 Axis 2 : Integrative Multi-Component Assembly and Modeling

Participants: Isaure Chauvot de Beauchêne, Marie-Dominique Devignes, Malika Smaïl-Tabbone, Bernard Maigret, Yasaman Karami, Hamed Khakzad, Dominique Mias-Lucquin, Antoine Moniot, Anna Kravchenko, Hrishikesh Dhondge, Diego Amaya Ramirez.

7.2.1 Modeling and design of RNA-RRM complexes

Our recent H2020 ITN project RNAct (2018-2022) aimed at designing new RNA-binding proteins based on the evolutionary conserved protein domain1, called RNA Recognition Motif (RRM). In this context and in the continuity of this project, we develop approaches to create 3D models of ssRNA-RRM complexes, building up on our expertise in RNA-protein modeling by fragments assembly 30. We use the ATTRACT docking software to sample and evaluate possible (low binding energy) positions of RNA fragments on the protein surface, then assemble the geometrically compatible poses with compatible sequences, to build a continuous model for the full RNA sequence.

Empirical inference of an RNA-protein energy function.

In the frame of Anna Kravchenko's thesis and in collaboration with Sjoerd de Vries (SISR team, LORIA), we have implemented a new method to create energy parameters for ssRNA-protein interactions in coarse-grained representation. In the ATTRACT procedure, each amino-acid of the protein and each nucleotide of the RNA is represented by 2 to 7 pseudo-atoms (“beads”). For each model of an RNA-protein interaction, the energy is computed as the sum of the bead-bead energies, and the model with the lowest energy is considered as the most probable. Each bead-bead energy depends solely on the 2 bead types (among 17 RNA types and 32 protein types) and the inter-bead distance. The models are then ranked by energy, in order to select the top-ranked poses that are supposed to be enriched in correct positions. We previously used parameters created in 2010 and tailored for double-stranded RNA, but their performance is poor on ssRNA. One main goal of Anna Kravchenko’s PhD was to optimize those parameters for a better discrimination of the correct positions for a given ssRNA fragment on an RRM protein domain. In 2022, we have set up a novel “histogram-based” approach, called HIPPO (HIstogram-based Pseudo-POtential). From a training set of solved RRM-ssRNA complexes, we derived a set of non-redundant RRM-fragment complexes (here a fragment is a tri-nucleotide of RNA). From each RRM-fragment complex and for each bead-bead type, we build a log-odds histogram of the occurrences of bead-bead distances (discretized into bins) observed in correct/incorrect poses retrieved from the fragment docking run corresponding to this RRM-fragment complex. The set of histograms obtained for a given RRM-fragment complex is called a "scoring potential" and later used to score the poses retrieved from fragment docking runs performed with RRMs and fragments derived from other solved RRM-ssRNA complexes constituting the test set.

In 2023, we have greatly improved this approach by creating a consensus scoring procedure from 4 sets of histogramms/scoring potentials that were identified to cover at best the diversity of RRM-ssRNA binding modes while avoiding over-fitting. We tested this consensus procedure on a benchmark of RRM-fragment complexes, extracted from 51 experimental structures of RRM-ssRNA complexes. HIPPO achieved a successful enrichment in correct poses (60% of correct poses in the 20% top-ranked poses) for 53% of our RRM-fragment complexes with HIPPO, versus 26% with the old parameters. Most importantly, HIPPO achieved a high enrichment for at least 1 fragment of the full ssRNA for 75% of the RRM-ssRNA complexes, versus 54% with the old parameters. This led us to adapt the assembling of geometrically compatible poses, by retaining less top-ranked poses for one of the fragments (iteratively supposed to be the highly enriched one), thus decreasing the complexity of the whole fragment-based docking procedure.

As a great surprise, we also found out that those parameters trained on ssRNA-RRM structure only, also perform better than the old parameters for ssRNA-protein complexes that do not contain the RRM protein domain. This encourage us to train a general scoring potential for all ssRNA-protein complexes in the near future. We also found that, in mot cases, a large majority of the top-ranked correct poses are selected by only one of the 4 sets of histograms. This work was presented at the ISMB-ECCB 2023 international conference in Lyon in July 2023 18 and a full paper is actually under revision.

For 2024, a way to improve HIPPO’s performance would be to predict which of the 4 sets will perform the best on a given protein-fragment case. This would avoid retaining the false positives returned by the other three. This may be achieved with supervised machine learning techniques based on the sequence of the fragment and the sequence or/and structure of the protein, and/or on the docking poses. Such a pre-trained classifier not only would drastically improve the performance of the scoring but could also give biological insight into the most prevalent protein-ssRNA binding modes.

Search of self-avoiding paths for fragment-based docking.

The assembly of compatible poses in our fragment-based modeling of RNA-protein complexes can be formalized as the search of paths of low global energy in a graph representing the pairwise compatibility of poses (fragment positions on the protein), each pose having a binding energy. The algorithm we used so far to sample paths does not take into account possible steric collisions - or clashes - between non-consecutive fragments. Chains containing such collisions have to be filtered a posteriori. In collaboration with Yann Ponty (LIX) and Fabrice Leclerc (IBC), we tried to avoid such collisions more efficiently, by integrating this global constraint during path sampling. We adapted the historical Noga Alon's color-coding algorithm2 for the search for self-avoiding paths of k vertices in a directed graph, in order to circumvent the NP-completeness of this problem. It consists in randomly assigning one of k colors to each vertex, applying an $O (n 2^{k})$ algorithm for finding well-colored paths of size k (1 vertex per color) with minimal global energy, and reiterating the method E times, where E is the expectation of the waiting time for a good coloring for a given path ( $E = (k^{k}) / k!$ ) which can be approximated by $e^{k}$ . In this way, we obtain the best chain of poses with a high probability in $O (n {(2 e)}^{k}))$ time, compared to a brute force complexity in $n^{k}$ .

To further reduce the size of the problem, and improve computation time, we have grouped the poses into cliques of clashing poses, which can therefore be systematically assigned the same color. The goal of cliques is to maximize the number of graph edges in a clique, i.e. edges which will never be chosen during path sampling, thus reducing the total number of possible paths. This work was recently accepted for the international conference RECOMB 2024, to be held April 29 - May2, 2024 in Cambridge MA (USA).

Cross-mapping of protein domain structural instances.

To address the issue of inconsistencies of protein domain classification in different domain databases, we developed a tool named Cross-Mapper for domain Structural instances (CroMaSt). This tool is a workflow that uses database querying and structural alignment to assign a confidence level to each 3D structure of a given domain in relation to its possible membership of a given domain type. This workflow was primarily developed for RRMs, but can easily be adapted to any structural domain. The workflow has been formalized using the Common Workflow Language (CWL) 3. Its rationale is based on the cross-mapping of PDB (Protein Data Bank) entries retrieved as RRM, in the Pfam and CATH domain classifications. Entries that are mapped in both classifications are considered as the “core” RRMs. Those that are specific to only one classification are further analyzed using structural alignment, in order to determine whether they are RRM-like or false RRMs 8. The workflow can be used to create datasets for specific domains with a good confidence level, which are useful for characterizing domain structural diversity or for further analyses such as machine learning, evolutionary studies or synthetic biology. This work was an important contribution of the doctoral thesis of Hrishikesh Dhondge who successfully defended it on 11 July 2023 15.

7.2.2 3D Modeling of proteins and protein complexes

Modeling protein-DNA complexes to fight antibiotic resistance.

In the context of the FEDER-CITRAM project (collaboration with the DynAMic and LPCT labs, Université de Lorraine), we study a new type of relaxase (RelSt3) as a key protein for horizontal DNA transfer, one of the processes responsible for the spread of antibiotic resistance in bacteria. The relaxase cuts parts of the bacterial DNA genome, then brings the cut DNA piece to the “coupling protein” inserted in the bacterial membrane, which transfers this piece to the recipient cell. In 2022, we contributed to the exploration of DNA processing features of RelSt3 50. In 2023, we extended this work to the 3D modeling of different types of relaxases and coupling proteins in different bacteria strains, thanks to AlphaFold and including the evaluation of the most reliable parts of each model. This paves the way for usage of those models in studying the relaxase-DNA and relaxase-coupling protein interactions, in order to target them by inhibitors.

In parallel, a new round of virtual screening inhibitors has been carried out targeting the ssDNA binding pocket of RelSt3. The inhibitory activity of the selected comppounds is currently being tested in the DynAMic lab.

Structural basis of donor-specific antibody response in graft rejection.

In the context of the Inria-Inserm PhD project of Diego Amaya Ramirez and in collaboration with the Immunology and Histocompatibility Laboratory at the APHP Saint-Louis Hospital in Paris, we study the structural differences between donor and recipient HLA proteins to understand and possibly predict the immune response triggered in the recipient, which can result in rejection of the transplanted organ. A dataset of 207 HLA 3D structures has been created, both from PDB and AlphaFold predictions. Molecular dynamics runs (10 ns) have been analyzed at the level of single residues or surface patches centered on surface residues. Surface patches are described by a set of 18 descriptors, including both static and dynamic properties, such as hydrophobicity, electrostatic charges, relative solvent-accessible surface area and side-chain flexibility. A machine-learning predictor for B-cell epitopes on HLA proteins (HLE-EpiCheck) has been trained using an Extra Trees ensemble learning method to discriminate between positive (confirmed epitopes) and negative (non epitope) surface patches. HLA-EpiCheck prediction performance outperformed the performance achieved by DiscoTope-3.0, a state-of-the-art B-cell epitope predictor in the task of predicting HLA epitopes. HLA-EpiCheck was also used to assess the epitope status of a subset of non-confirmed eplets. The predictions were compared to experimental data and a notable consistency was found. These results suggest that HLA-EpiCheck could be used to better define HLA matching between donor and recipient in order to reduce de novo DSA formation and graft rejection. This work was presented as a poster at the ISMB-ECCB 2023 International conference in July 2023 17. A full paper has been submitted for publication in December 2023 (preprint available on bioRXiv).

7.2.3 Investigating the dynamical behaviour of protein complexes with MD simulation

Analysis of COVID-19 spike binding to ACE2.

In collaboration with Laurent Chaloin at the IRIM (Institut de Recherche en Infectiologie de Montpellier, CNRS-Université de Montpellier), we investigated the interaction between the RBD domain of SARS-CoV-2 Spike protein and human ACE2 along very long molecular dynamics (MD) simulations (1.5 µs, performed on Jean Zay super computer at GENCI, IDRIS-CNRS in Orsay). A panel of methods was used to analyse the MD trajectories. The goal of this project is to compare the Omicron variants with the Delta variant of SARS-CoV-2. The manuscript is under revision.

Nucleosomes and post-translational modifications.

Nucleosomes are made of proteins and DNA and Yasaman Karami has started a new collaboration with Emmanuelle Bignon, a recently recruited CNRS researcher at LPCT (Laboratoire de Physique et Chimie Théorique). Together, they study the effect of post-translational modifications of proteins on the nucleosome4, using MD simulations and COMMA analysis 48. The results of their work are now under consideration at the Computational and Structural Biotechnology journal with a favorable revision (preprint available on bioRXiv).

M1-IgG interaction network and dynamics.

In collaboration with Pontus Nordenfelt (Lund University), Yasaman Karami and Hamed Khakzad investigated the role of hinge flexibility for the opsonic function of antibodies5. An opsonic IgG1 monoclonal antibody, Ab25 25, targeting the M protein of the bacteria Streptococcus pyogenes (S. pyogenes), was engineered into the IgG2-4 subclasses. Despite reduced binding, the IgG3 version demonstrated enhanced opsonic function. MD simulations showed that IgG3 Fc region exhibits extensive mobility in 3D space relative to the antigen due to its extended hinge region. The MD simulations also showed altered Fab-antigen interactions, in line with IgG3 diminished affinity. We explored the impact of hinge-engineering by generating a panel of IgG antibodies, IgGh, containing the CH1-3 domains of IgG1 and different segments of IgG3 hinge. Hinge-engineering enhanced opsonic function, with the most potent hinge having 47 amino acids. IgGh47 far exceeded the parent IgG1 and even the IgG3 version. The IgGh47 was protective against S. pyogenes in a systemic infection mouse model, contrary to parent IgG3 and IgG1. The in vitro phenotype of IgGh47 was generalizable to clinical isolates with different M protein types. Finally, we generated IgGh47 versions of anti-SARS-CoV-2 mAbs, which exhibited strongly enhanced in vitro opsonic function compared to the original IgG1. The improved function of the IgGh47 subclass in two distant biological systems provides new insights into antibody function and how to enhance it for opsonic function. The paper received favorable revision from Nature Communications (preprint available in BioRXiv 45).

Unraveling the complexity of glycosaminoglycan deficiencies caused by B3GALT6 loss-of-function mutations: a multifaceted approach

Deleterious variants in beta-1,3-galactosyltransferase (B3GALT6) compromise the early steps of glycosaminoglycan (GAG) synthesis causing a spondylodysplastic subvariety of Ehlers-Danlos syndromes (spEDS). Unfortunately, the mechanisms by which pathogenic mutations impair B3GALT6 function and translate into a severe, often lethal, connective tissue disorder are poorly defined. In collaboration with Sylvie Fournel-Gigleux at IMoPA, Nancy, Yasaman Karami and Hamed Khakzad dissected the patho-mechanisms of spEDS by a multi-tiered approach from exploring the protein structure alterations by MD simulations to atomic force microscopy-based extracellular matrix (ECM) investigations. We found that pathogenic variants produced a unique structural B3GalT6 alteration resulting in loss of function that was partially rescued by glucuronosyltransferase 1 (GlcAT-1), the next enzyme in the pathway. Transcriptomics revealed that the resulting GAG defects predominantly dysregulated collagen maturation. In line, by leveraging a B3galT6-invalidated ECM model recapitulating the condition, we suggest that the absence of collagen XII glycosylation contributes to altered tissue structure and biomechanics. Our findings uncover a novel link between GAG and collagen defects and shed light on the pathobiology of spEDS. The manuscript has been submitted recently to Journal of Clinical Investigations. In continuity with this work, the team is now involved in the ANR GlycoLink project (2023-2027) coordinated by Sylvie Fournel-Gigleux.

7.2.4 Machine learning methods for proteomics, interactomics and protein design

Deep Learning and de novo MS-based peptide sequencing.

In the recent years, the application of deep learning represented a breakthrough in the mass spectrometry (MS) field by improving the assignment of the correct sequence of amino acids from observable MS spectra without prior knowledge, also known as de novo MS-based peptide sequencing. However, like other modern neural networks, models do not generalize well enough as they perform poorly on highly varying N- and C-termini peptide test sets. To mitigate this generalization problem, Hamed Khakzad and colleagues from Lund University (Sweden) conducted a systematic investigation to unravel the requirements for building generalizable models and boosting the performance on the MS-based de novo peptide sequencing task. Several experiments confirmed that the peptide diversity of the training set directly impacts the resulting generalizability of the model. Data showed that the best models were the multienzyme models (MEMs), i.e., models trained from a compendium of highly diverse peptides, such as the one generated by digesting samples from a wide variety of species with a group of proteases. The applicability of these MEMs was later established by fully de novo sequencing eight of the ten polypeptide chains of five commercial antibodies and extracting over 10,000 proving peptides10.

Machine learning and protein design.

In collaboration with the team of Pierre Tuffery (Université Paris-Cité), Yasaman Karami developed a method to design cyclic peptides and propose candidate linkers. Large-scale data-mining of available protein structures was previously shown as useful for the precise identification of protein loop conformations, even from remote structural classes 47. This approach was transposed to linkers, allowing head-to-tail peptide cyclization. This project has strong potential for cyclic petpide-based drug design and was published in 2023 in Journal of Chemical Informatics and Modelling 49.

From a more general point of view, Hamed Khakzad has written in collaboration with Bruno Correia's team at Ecole Polytechnique Fédérale de Lausanne and Michael Bronstein from University of Oxford, a review on the recent developments and technologies in deep learning methods with examples of their performance to generate novel functional proteins 13.

7.2.5 Miscellaneous results on structural studies of host-pathogen interactions

Mutational analysis of vinculin and cell adhesion

Vinculin is a cytoskeletal linker strengthening cell adhesion. The Shigella IpaA invasion effector binds to vinculin to promote vinculin supra-activation associated with head-domain mediated oligomerization. In collaboration with Pr Guy Tran van Nhieu (I2BC, Université Paris-Saclay), Hamed Khakzad has investigated the impact of mutations within the vinculin D1D2 subdomains, which are predicted to interact with IpaA VBS3. These mutations influence the rate of D1D2 trimer formation, with distinct effects on monomer disappearance, consistent with structural modeling of a “closed” and “open” D1D2 conformer induced by IpaA. Notably, mutations targeting the closed D1D2 conformer significantly reduced Shigella invasion of host cells, in contrast to a mutation targeting a putative D2 coiled-coil motif or a cysteine clamp affecting later stages of vinculin head-domain oligomerization. All mutations affected the focal adhesions (FAs) formation. Our findings suggest that IpaA-induced vinculin supra-activation primarily reinforces matrix adhesion in infected cells, rather than promoting bacterial invasion. Consistently, shear stress studies pointed to a key role for IpaA-induced vinculin supra-activation in accelerating and strengthening cell matrix adhesion. Additionally, our results support the involvement of vinculin supra-activation in FAs maturation and cell adhesion. The manuscript received favorable revisions from Life Science Alliance (preprint available in bioRXiv 32). This work initiated the ANR grant application TRIVIAL under evaluation for the AAPG2024.

Functional proteomics to reveal streptolysin O as a novel plasminogen-binding streptococcal protein

The bacteria S. pyogenes is a highly adaptive human specific pathogen weaponized with multifunctional bacterial proteins, known to exploit the plasminogen (PLG)-plasmin (PLM) system to promote its dissemination and survival in the human host. In collaboration with Pr Johan Malmström (Lund University, Sweden), Hamed Khakzad applied a series of functional proteomics methods to chart the protein-protein interaction landscape centred on streptolysin O (SLO), revealing that this critical cytolytic toxin binds specifically to PLG. Binding of SLO to PLG potentially alters the shape of PLG from a compact to a partially relaxed conformation, thereby accelerating more PLM production via tissue-type plasminogen activator. Our results reveal a conserved moonlighting pathomechanistic role for SLO carried in all S. pyogenes isolates, extending beyond its established cytolytic activity. A deeper investigation is warranted in order to better understand the profound relationships between bacterial adaptation and host haemostasis during different stages of infection. A manuscript is under preparation to describe this work.

8 Partnerships and cooperations

8.1 International initiatives

8.1.1 Participation in other International Programs

NewDAFI

Participants: Malika Smaïl-Tabbone, Sabeur Aridhi, Marie-Dominique Devignes, Bernard Maigret.

Funding:
COFECUB - CAPES 2023
Title:
New drugs against invasive fungal infections: from hits to optimised leads through machine learning
Partner Institution(s):
The Catholic University of Brasília (UCB), Brazil
Date/Duration:
2023-2026
Additionnal info/keywords:
The main objective of this project is to convert the previous knowledge acquired, from comparative genomics, selection of new therapeutic targets and identification of a new class of antifungals, into a product that can reach the preclinical phase, and that effectively contributes to the fight against fungal emerging diseases and nosocomial infections. Our ambition is also to perpetuate the international scientific exchange between Brazil and France for training qualified human resources, especially in interdisciplinary areas.

8.2 International research visitors

8.2.1 Visits of international scientists

Other international visits to the team

Pr Edward Egelman

Status
Professor
Institution of origin:
University of Virginia
Country:
USA
Dates:
11-13 July 2023
Context of the visit:
Ongoing collaboration with Yasaman Karami and 1st Nancy Computational Structural Biology (NCSB) day
Mobility program/type of mobility:
Invitation for a lecture and scientific discussions

Pr Maria Sueli Felipe

Status
Professor
Institution of origin:
Catholic University of Brasilia
Country:
Brazil
Dates:
22-27 October 2023
Context of the visit:
Exchange of senior scientists in the frame of a COFECUB-CAPES program (2023-2026) coordinated by Malika Smaïl-Tabbone.
Mobility program/type of mobility:
Lecture and scientific discussions

8.2.2 Visits to international teams

Research stays abroad

Marie-Dominique Devignes

Visited institution:
Catholic University of Brasilia and State University of Maringa
Country:
Brazil
Dates:
20-30 November 2023
Context of the visit:
Exchange of senior scientists in the frame of a COFECUB-CAPES program (2023-2026) coordinated by Malika Smaïl-Tabbone.
Mobility program/type of mobility:
Lectures and Master Classes

Sabeur Aridhi

Visited institution:
University of Trento, Pr Alberto Montresor.
Country:
Italy
Dates:
21-26 May 2023
Context of the visit:
Inria delegation and short fellowship support (LORIA).
Mobility program/type of mobility:
scientific discussions.

8.3 European initiatives

8.3.1 Other european programs/initiatives

Through her implication in the French institute of Bioinformatics (joint coordination of the Open Science and Interoperability taskforce), Marie-Dominique Devignes is a member of the European ELIXIR Interoperability platform.

8.4 National initiatives

ANR EPIHLA

Participants: Marie-Dominique Devignes [contact person], Malika Smaïl-Tabbone, Diego Amaya Ramirez, Bernard Maigret.

Title:
HLA compatibility in organ transplantation : from antigens to epitopes (EPIHLA)
Duration:
October 2022-October 2025
Coordinator:
Pr. Jean-Luc Taupin (Inserm U976, Saint-Louis Hospital, Paris)
Inria contact:
Marie-Dominique Devignes
Partner Institutions:
- Inserm U976 IRSL Saint-Louis Hospital (Paris)
- LORIA CNRS (Nancy)
- INSERM U1016 Cochin Institute (Paris)
- CNRS U144 Institut Curie (Paris)
Summary:
The EPIHLA project has two major aims. (1) It aims at correctly representing HLA molecule 3D structure and superimposing predicted conformations in order to identify 3D differences that could constitute epitopes and eplets, targets of donor-specific antibodies. (2) It aims at developing the capacity to isolate and clone anti-HLA antibody genes from patients’ B lymphocytes. The results will provide decisive new information on the understanding of humoral alloreactivity and will make it possible to better anticipate transplant rejection. This project was initially based on the Inria-Inserm PhD project of Diego Amaya Ramirez (2019-2022). This thesis ("HLA genetic system and organ transplantation: understanding the basics of immunogenicity to improve donor - receptor compatibility when assigning grafts to recipients") is not finished yet and still co-supervised by Marie-Dominique Devignes and Pr. Jean-Luc Taupin.

ANR GlycoLink

Participants: Hamed Khakzad [contact person], Yasaman Karami, Isaure Chauvot de Beauchêne.

Title:
Exploring Glycosyltransferases complexes involved in glycosaminoglycan-Linker assembly
Duration:
October 2023-October 2026
Coordinator:
Pr. Sylvie Fournel-Gigleux (UMR 7365 CNRS-Université de Lorraine)
Inria contact:
Hamed Khakzad
Partner Institutions:
- UMR 7365 CNRS-Université de Lorraine (Nancy)
- IBS - UMR 5075 (Grenoble)
- INRIA Center of Université de Lorraine (Nancy)
- ICOA, UMR 7311 CNRS U-Orléans (Orléans)
Summary:
The supramolecular arrangement of glycosyltransferases (GT) and accessory enzymes (kinase, phosphatase, sulfotransferases) governs the repertoire of carbohydrates and their multiple biological functions. The assembly of glycosaminoglycans (GAG), a major class of linear glycopolymers, is initiated by a unique tetrasaccharide linker (GlcAb1-3Galb1-3Galb1-4Xyl) synthesized by the coordinated action of five GT. It serves as a unique primer for the elongation of GAG chains. A recent paradigm puts forward a multimolecular complex called “GAGosome" formed of GT and auxiliary enzymes as a major mechanism to guarantee fidelity and efficiency of GAG synthesis. Indeed, association of the heparan-sulfate (HS) polymerases EXT1/EXT2 and of chondroitin-sulfate (CS) synthases (CSS) sustain GAG elongation. Recent evidence also supports complex formation between GAG-linker enzymes. GlycoLink will explore the formation and organization of multimolecular complexes between GT and partners involved in the assembly of the GAG-linker region, and their functional impact in rare genetic diseases.

8.5 Regional initiatives

LUE-FEDER CITRAM (2017-2023)

The CITRAM project (Conception d'Inhibiteurs de la Transmission de Résistances AntI-Microbiennes), co-funded by Lorraine Université d'Excellence (LUE) and FEDER was extended until June 2023.

Partners other than CAPSID are:

DynAMic lab (Genome dynamics and microbial adaptation, INRAE-Université de Lorraine UMR 1128), Team of Nathalie Leblond, coordinator ;
LPCT (Laboratoire de Physique et Chimie Théoriques, CNRS-Université de Lorraine UMR 7019), Team of Chris Chipot.

MolAI4Cryo (2023-2025)

The molAI4Cryo project (Modeling and Artificial Intelligence applied to Cryo-EM 3D structures to fight COVID) has obtained a grant from the Région Grand-Est to equip the IMoPA laboratory with a cutting-edge equipement in cryo-electromicroscopy. The CAPSID team will be involved in proposing algorithms to use cryo-EM results as constraints for modeling protein-RNA complexes.

Partners other than CAPSID are:

IMoPA (Ingénierie Moléculaire et Physiopathologie Articulaire, CNRS-Université de Lorraine UMR 7365), Team of Xavier Manival, coordinator ;
DynAMic lab (Genome dynamics and microbial adaptation, INRAE-Université de Lorraine UMR 1128), Team of Nathalie Leblond and Nicolas Soler ;
LPCT (Laboratoire de Physique et Chimie Théoriques, CNRS-Université de Lorraine UMR 7019), Team of Chris Chipot and François Dehez.

9 Dissemination

Participants: All Team Members.

9.1 Promoting scientific activities

9.1.1 Scientific events: organisation

Yasaman Karami organized in Nancy (12 July 2023) a seminar called NCSB (Nancy Computational Structural Biology) with the help from both Inria and Loria. The event was held successfully with about 50 participants at the Université de Lorraine. We had the pleasure of having Professor Edward Egelman as the keynote speaker. In total we had 7 oral communications by both regional (IMoPA, DynaAMic, LPCT, Inria-Loria) and international (Germany, USA) scientists.

Member of the organizing committees

Isaure Chauvot de Beauchêne was member of the organizing committee of AlgoSB Winter school 2023: "From structure resolution to dynamical modeling in cryo-electron microscopy".

Member of the conference program committees

Malika Smaïl-Tabbone was member of the program committee of DATA ANALYTICS 2023 international conference and EGC 2023 francophone conference.

Sabeur Aridhi was member of the program committee of IEEE Big Data 2023 international conference and of the ICML (International Conference on Machine Learning) Workshop on Computational Biology (WCB) 2023. Isaure Chauvot de Beauchêne was member of the program committee of ISMB-ECCB 2023 international conference (Lyon, 23-27 July 2023).

Member of the editorial boards

Malika Smaïl-Tabbone is member of the Scientific Reports Editorial Board (from December 2023).

Marie-Dominique Devignes is guest editor of Bioinformatic Advances.

Reviewer - reviewing activities

Marie-Dominique Devignes reviewed articles for Computational Structural Biotechnology Journal.

Yasaman Karami reviewed articles for Nucleic Acid Research (NAR) and Computational Structural Biotechnology journals.

9.1.2 Invited talks

Marie-Dominique Devignes gave an invited talk on "Open Science at the French Institute of Bioinformatics" during the Love Data Week organized at the Université de Lorraine, 13-16 March 2023.

Yasaman Karami gave an invited talk on “Conformational dynamics of Type-IV Pilus” during the Statistical Physics and Low Dimensional systems (SPLDS) Conference organized by the LPCT lab at Pont-a-Mousson, 24-26 May, 2023.

9.1.3 Leadership within the scientific community

Marie-Dominique Devignes is co-leader of the Open Science and Interoperability Task at the French Institute of Bioinformatics (with Alban Gaignard - Institut du Thorax, Nantes, Platform BiRD, and Frederic de Lamotte - INRAE, Department of Biology et Plant Improvement, Montpellier). In this frame, she is member of the ELIXIR European Interoperability platform and she co-authored a paper on automatic FAIR assessment of web resource data 9.

Hamed Khakzad is an active member of the Rosetta Commons scientific community.

9.1.4 Scientific expertise

Sabeur Aridhi reviewed an ANR project in 2023.

9.1.5 Research administration

Malika Smaïl-Tabbone is member of the Scientific Council of the Université de Lorraine. As such, she carries out various scientific expertises within the broad framework of the University of Lorraine.

9.2 Teaching - Supervision - Juries

9.2.1 Teaching

Malika Smaïl-Tabbone is an associate professor at the Université de Lorraine with a full service. She is co-responsible with Pascal Moyal of the IMSD track ("Ingénierie Mathématique pour la Science des Données") in the Applied Mathematics Master's degree at the Université de Lorraine. She is also a member of the pedagogic team of the CMI BSE ("Cursus Master Ingénieur Biologie-Santé-Environnement").
Sabeur Aridhi is an assistant professor at the Université de Lorraine with a full service. He is responsible for the major in IAMD ("Ingénierie et Applications des Masses de Données") at TELECOM Nancy.
Marie-Dominique Devignes teaches every year 10 to 16h in the CMI BSE.
Diego Amaya Ramirez held an ATER position from January to August 2023.
Yasaman Karami gave 27 hours on Machine learning with Python as part of the Data Scientist continuing education course at the Institute for Digital Management and Cognition (IDMC), Nancy.
Hamed Khakzad gave 16 hours on deep learning methods in the frame of the Advanced AI course (3A) at TELECOM Nancy.

9.2.2 Supervision

In 2023, there have been in total 7 PhD students supervised by CAPSID members. Three of them successfully defended their thesis. Three of them have been recruited and started their PhD in October 2023. Hamed Khakzad is also co-supervising with Rebekka Wild a PhD thesis that started in October 2023 at the University of Grenoble.

9.2.3 Juries

Malika Smaïl-Tabbone was reviewer of two PhD theses.

Marie-Dominique Devignes was reviewer of one PhD thesis and examiner of two PhD theses.

(Not counting the participation of team members in the juries of CAPSID PhD students)

9.2.4 Internal or external Inria responsibilities

Yasaman Karami is a member of the Inria Comité de Développement Technologique (CDT). The main task of the committee is to evaluate the application and recruitement of research engineers.

10 Scientific production

10.1 Major publications

1 articleS. Z.Seyed Ziaeddin Alborzi, A.Amina Ahmed Nacer, H.Hiba Najjar, D. W.David W Ritchie and M.- . D.Marie- Dominique Devignes. PPIDomainMiner: Inferring domain-domain interactions from multiple sources of proteinprotein interactions.PLoS Computational Biology1782021, e1008844HAL DOI
2 articleE.Emmanuel Bresso, J.-P.Joao-Pedro Ferreira, N.Nicolas Girerd, M.Masatake Kobayashi, G.Grégoire Preud’homme, P.Patrick Rossignol, F.Fayez Zannad, M.-D.Marie-Dominique Devignes and M.Malika Smaïl-Tabbone. Inductive database to support iterative data mining: Application to biomarker analysis on patient data in the Fight-HF project.Journal of Biomedical Informatics135November 2022, 104212HAL DOI
3 articleM. K.Md Kamrul Islam, D.Diego Amaya-Ramirez, B.Bernard Maigret, M.-D.Marie-Dominique Devignes, S.Sabeur Aridhi and M.Malika Smaïl-Tabbone. Molecular-evaluated and explainable drug repurposing for COVID-19 using ensemble knowledge graph embedding.Scientific Reports131March 2023, 3643HAL DOI
4 articleA.Antoine Moniot, Y.Yann Guermeur, S. J.Sjoerd Jacob de Vries and I.Isaure Chauvot de Beauchêne. ProtNAff: Protein-bound Nucleic Acid filters and fragment libraries.Bioinformatics38162022-07-012022, 3911–3917HAL DOI
5 articleD. W.David W. Ritchie and S.Sergei Grudinin. Spherical polar Fourier assembly of protein complexes with arbitrary point group symmetry.Journal of Applied Crystallography491February 2016, 158-167HAL DOI
6 articleM. E.Maria Elisa Ruiz Echartea, I.Isaure Chauvot de Beauchêne and D.David Ritchie. EROS-DOCK: Protein-Protein Docking Using Exhaustive Branch-and-Bound Rotational Search.Bioinformatics35232019, 5003–5010HAL DOI back to text

10.2 Publications of the year

International journals

7 articleM.-D.Marie-Dominique Devignes, M.Malika Smaïl-Tabbone, H.Hrishikesh Dhondge, R.Roswitha Dolcemascolo, J.Jose Gavaldá-García, R. A.R. Anahí Higuera-Rodriguez, A.Anna Kravchenko, J.Joel Roca Martínez, N.Niki Messini, A.Anna Pérez-Ràfols, G.Guillermo Pérez Ropero, L.Luca Sperotto, I.Isaure Chauvot de Beauchêne and W.Wim Vranken. Experiences with a training DSW knowledge model for early-stage researchers.Open Research Europe32023, 97HAL DOI back to text
8 articleH.Hrishikesh Dhondge, I.Isaure Chauvot de Beauchêne and M.-D.Marie-Dominique Devignes. CroMaSt: a workflow for assessing protein domain classification by cross-mapping of structural instances between domain databases and structural alignment.Bioinformatics Advances31January 2023HAL DOI back to text
9 articleA.Alban Gaignard, T.Thomas Rosnet, F.Frédéric de Lamotte, V.Vincent Lefort and M.-D.Marie-Dominique Devignes. FAIR-Checker: supporting digital resource findability and reuse with Knowledge Graphs and Semantic Web standards.Journal of Biomedical Semantics141July 2023, 7HAL DOI back to text
10 articleC.Carlos Gueto-Tettay, D.Di Tang, L.Lotta Happonen, M.Moritz Heusel, H.Hamed Khakzad, J.Johan Malmström and L.Lars Malmström. Multienzyme deep learning models improve peptide de novo sequencing by mass spectrometry proteomics.PLoS Computational Biology191January 2023, e1010457HAL DOI back to text
11 articleM. K.Md Kamrul Islam, D.Diego Amaya-Ramirez, B.Bernard Maigret, M.-D.Marie-Dominique Devignes, S.Sabeur Aridhi and M.Malika Smaïl-Tabbone. Molecular-evaluated and explainable drug repurposing for COVID-19 using ensemble knowledge graph embedding.Scientific Reports131March 2023, 3643HAL DOI back to text
12 articleN. E.Nour El Islem Karabadji, A.Abdelaziz Amara Korba, A.Ali Assi, H.Hassina Seridi, S.Sabeur Aridhi and W.Wajdi Dhifli. Accuracy and diversity-aware multi-objective approach for random forest construction.Expert Systems with Applications2251April 2023, 120138HAL DOI back to text
13 articleH.Hamed Khakzad, I.Ilia Igashov, A.Arne Schneuing, C.Casper Goverde, M.Michael Bronstein and B.Bruno Correia. A new age in protein design empowered by deep learning.Cell Systems1411November 2023, 925-939HAL DOI back to text

International peer-reviewed conferences

14 inproceedingsJ.Joachim Niehren, C.Cédric Lhoussaine and A.Athénaïs Vaginay. Core SBML and its Formal Semantics.CMSB 2023 - 21th International Conference on Formal Methods in Systems BiologyLuxembourg, LuxembourgSeptember 2023HAL back to text

Doctoral dissertations and habilitation theses

15 thesisH.Hrishikesh Dhondge. Structural characterization of RNA binding to RNA recognition motif (RRM) domains using data integration, 3D modeling and molecular dynamic simulation.Université de LorraineJuly 2023HAL back to text
16 thesisA.Athénaïs Vaginay. Synthesis of Boolean networks from the structure and dynamics of reaction networks.Université de LorraineJuly 2023HAL back to text

Other scientific publications

17 inproceedingsD.Diego Amaya-Ramirez, R.Romain Lhotte, C.Cedric Usureau, M.Magali Devriese, M.Malika Smaïl-Tabbone, J.-L.Jean-Luc Taupin and D.Devignes Marie-Dominique. HLA-EpiCheck : A B-cell epitope prediction tool on HLA antigens using molecular dynamics simulation data.ISMB-ECCB 2023 - Intelligent System in Molecular Biology and European Conference on Computational Biology merged eventLyon, FranceJuly 2023HAL back to text
18 inproceedingsA.Anna Kravchenko, S.Sjoerd Jacob de Vries, M.Malika Smaïl-Tabbone and I.Isaure Chauvot de Beauchene. HIPPO: HIstogram-based Pseudo-POtential for scoring ssRNA-protein fragment-based docking poses.The 31st Annual Intelligent Systems For Molecular Biology (ISMB) and the 22nd Annual European Conference on Computational Biology (ECCB)Lyon, FranceJuly 2023HAL back to text

10.3 Cited publications

19 articleC.C. Alfeghaly, I.I. Behm-Ansmant and S.S. Maenner. Study of Genome-Wide Occupancy of Long Non-Coding RNAs Using Chromatin Isolation by RNA Purification (ChIRP).Methods Mol Biol23002021, 107--117back to text
20 articleC. E.C. E. Alvarez-Martinez and P. J.P. J. Christie. Biological diversity of prokaryotic type IV secretion systems.Microbiology and Molecular Biology Reviews732011, 775--808back to text
21 articleA.A. Andreeva, D.D. Howorth, C.C. Chothia, E.E. Kulesha and A. G.A. G. Murzin. SCOP2 prototype: a new approach to protein structure mining.Nucleic Acids Res42Database issueJan 2014, D310--314DOI back to text
22 articleA.A. Andreeva, E.E. Kulesha, J.J. Gough and A. G.A. G. Murzin. The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures.Nucleic Acids Res48D101 2020, D376-D382DOI back to text
23 articleM.M. Baaden and S. R.S. R. Marrink. Coarse-grained modelling of protein-protein interactions.Current Opinion in Structural Biology232013, 878--886back to text
24 articleM.M. Baek, F.F. DiMaio, I.I. Anishchenko, J.J. Dauparas, S.S. Ovchinnikov, G. R.G. R. Lee, J.J. Wang, Q.Q. Cong, L. N.L. N. Kinch, R. D.R. D. Schaeffer, C.C. Millán, H.H. Park, C.C. Adams, C. R.C. R. Glassman, A.A. DeGiovanni, J. H.J. H. Pereira, A. V.A. V. Rodrigues, A. A.A. A. van Dijk, A. C.A. C. Ebrecht, D. J.D. J. Opperman, T.T. Sagmeister, C.C. Buhlheller, T.T. Pavkov-Keller, M. K.M. K. Rathinaswamy, U.U. Dalwadi, C. K.C. K. Yip, J. E.J. E. Burke, K. C.K. C. Garcia, N. V.N. V. Grishin, P. D.P. D. Adams, R. J.R. J. Read and D.D. Baker. Accurate prediction of protein structures and interactions using a three-track neural network.Science373655708 2021, 871--876DOI back to text
25 articleW.Wael Bahnan, L.Lotta Happonen, H.Hamed Khakzad, V.Vibha Kumra Ahnlide, T.Therese de Neergaard, S.Sebastian Wrighton, O.Oscar André, E.Eleni Bratanis, D.Di Tang, T.Thomas Hellmark, L.Lars Björck, O.Oonagh Shannon, L.Lars Malmström, J.Johan Malmström and P.Pontus Nordenfelt. A human monoclonal antibody bivalently binding two different epitopes in streptococcal M protein mediates immune function.EMBO Molecular Medicine1522022, e16208URL: https://www.embopress.org/doi/abs/10.15252/emmm.202216208DOI back to text
26 articleM.M. Blum, H. Y.H. Y. Chang, S.S. Chuguransky, T.T. Grego, S.S. Kandasaamy, A.A. Mitchell, G.G. Nuka, T.T. Paysan-Lafosse, M.M. Qureshi, S.S. Raj, L.L. Richardson, G. A.G. A. Salazar, L.L. Williams, P.P. Bork, A.A. Bridge, J.J. Gough, D. H.D. H. Haft, I.I. Letunic, A.A. Marchler-Bauer, H.H. Mi, D. A.D. A. Natale, M.M. Necci, C. A.C. A. Orengo, A. P.A. P. Pandurangan, C.C. Rivoire, C. J.C. J. A. Sigrist, I.I. Sillitoe, N.N. Thanki, P. D.P. D. Thomas, S. C.S. C. E. Tosatto, C. H.C. H. Wu, A.A. Bateman and R. D.R. D. Finn. The InterPro protein families and domains database: 20 years on.Nucleic Acids Res49D101 2021, D344-D354DOI back to text
27 articleA.Anaïs Bonneau and C.Caroline Monchaud. La transplantation d’organes en France.Actualités Pharmaceutiques606052021, 18-20URL: https://www.sciencedirect.com/science/article/pii/S0515370021000562DOI back to text
28 articleD. F.D. F. Burke, P.P. Bryant, I.I. Barrio-Hernandez, D.D. Memon, G.G. Pozzati, A.A. Shenoy, W.W. Zhu, A. S.A. S. Dunham, P.P. Albanese, A.A. Keller, R. A.R. A. Scheltema, J. E.J. E. Bruce, A.A. Leitner, P.P. Kundrotas, P.P. Beltrao and A.A. Elofsson. Towards a structurally resolved human protein interaction network.Nat Struct Mol Biol302Feb 2023, 216--225back to text
29 miscI. J.Isaure J Chauvot De Beauchene, S. J.Sjoerd J De Vries and M. J.Martin J Zacharias. Fragment-based modeling of protein-bound ssRNA.PosterSeptember 2016HAL back to text
30 articleI.Isaure Chauvot de Beauchêne, S. J.Sjoerd Jacob De Vries and M.Martin Zacharias. Fragment-based modelling of single stranded RNA bound to RNA recognition motif containing proteins.Nucleic Acids ResearchJune 2016HAL DOI back to text back to text back to text
31 inbookA.A. Clery and F.F. Allain. RNA binding proteins.L.Lorkovic ZdravkoLandes Bioscience and Springer Science+Business Media2011, From Structure to Function of RNA Binding domainsback to text
32 articleB.Benjamin Cocom-Chan, H.Hamed Khakzad, M.Mahamadou Konate, D. I.Daniel Isui Aguilar, C.Chakir Bello, C.Cesar Valencia-Gallardo, Y.Yosra Zarrouk, J.Jacques Fattaccioli, A.Alain Mauviel, D.Delphine Javelaud and G. T.Guy Tran Van Nhieu. IpaA reveals distinct modes of vinculin activation during Shigella invasion and cell-matrix adhesion.bioRxiv2023, URL: https://www.biorxiv.org/content/early/2023/10/09/2023.03.23.533139DOI back to text
33 articleS. J.Sjoerd Jacob De Vries, I.Isaure Chauvot de Beauchêne, C. E.Christina Eva Maria Schindler and M.Martin Zacharias. Cryo-EM Data Are Superior to Contact and Interface Information in Integrative Modeling.Biophysical JournalFebruary 2016HAL DOI back to text
34 articleS. E.S. E. Dobbins, V. I.V. I. Lesk and M. J.M. J. E. Sternberg. Insights into protein flexibility: The relationship between normal modes and conformational change upon protein--protein docking.Proceedings of National Academiy of Sciences105302008, 10390--10395back to text
35 articleR.Richard Evans, M.Michael O\textquoteright}Neill, A.Alexander Pritzel, N.Natasha Antropova, A.Andrew Senior, T.Tim Green, A.Augustin {Żídek, R.Russ Bates, S.Sam Blackwell, J.Jason Yim, O.Olaf Ronneberger, S.Sebastian Bodenstein, M.Michal Zielinski, A.Alex Bridgland, A.Anna Potapenko, A.Andrew Cowie, K.Kathryn Tunyasuvunakool, R.Rishub Jain, E.Ellen Clancy, P.Pushmeet Kohli, J.John Jumper and D.Demis Hassabis. Protein complex prediction with AlphaFold-Multimer.bioRxiv2021, URL: https://www.biorxiv.org/content/early/2021/10/04/2021.10.04.463034DOI back to text back to text
36 articleW. J.W. J. Frawley, G.G. Piatetsky-Shapiro and C. J.C. J. Matheus. Knowledge Discovery in Databases: An Overview.AI Magazine131992, 57--70back to text
37 articleR.R. Fronzes, E.E. Schäfer, L.L. Wang, H. R.H. R. Saibil, E. V.E. V. Orlova and G.G. Waksman. Structure of a type IV secretion system core complex.Science3232011, 266--268back to text
38 articleA.Anisah Ghoorah, M.-D.Marie-Dominique Devignes, M.Malika Sma\"il-Tabbone and D.David Ritchie. KBDOCK 2013: A spatial classification of 3D protein domain family interactions.Nucleic Acids Research42D1January 2014, 389-395HAL DOI back to text
39 articleA.Anisah Ghoorah, M.-D.Marie-Dominique Devignes, M.Malika Sma\"il-Tabbone and D.David Ritchie. Protein Docking Using Case-Based Reasoning.Proteins - Structure, Function and Bioinformatics8112October 2013, 2150-2158HAL DOI back to text
40 articleA.Anisah Ghoorah, M.-D.Marie-Dominique Devignes, M.Malika Sma\"il-Tabbone and D.David Ritchie. Spatial clustering of protein binding sites for template based protein docking.Bioinformatics2720August 2011, 2820-2827HAL DOI back to text
41 articleD. S.D. S. Himmelstein, A.A. Lizee, C.C. Hessler, L.L. Brueggeman, S. L.S. L. Chen, D.D. Hadley, A.A. Green, P.P. Khankhanian and S. E.S. E. Baranzini. Systematic integration of biomedical knowledge prioritizes drugs for repurposing.Elife6Sep 2017back to text
42 articleH. I.H. I. Ingólfsson, C. A.C. A. Lopez, J. J.J. J. Uusitalo, D. H.D. H. de Jong, S. M.S. M. Gopal, X.X. Periole and S. R.S. R. Marrink. The power of coarse graining in biomolecular simulations.WIRES Comput. Mol. Sci.42013, 225--248URL: http://dx.doi.org/10.1002/wcms.1169back to text
43 articleK.Kamrul Islam, S.Sabeur Aridhi and M.Malika Sma\"il-Tabbone. Negative Sampling and Rule Mining for Explainable Link Prediction in Knowledge Graphs.Knowledge-Based Systems250August 2022, 109083HAL DOI back to text back to text
44 phdthesisK.Kamrul Islam. Explainable link prediction in large complex graphs - application to drug repurposing.Université de LorraineDecember 2022HAL back to text
45 articleA.Arman Izadi, Y.Yasaman Karami, E.Eleni Bratanis, S.Sebastian Wrighton, H.Hamed Khakzad, M.Maria Nyblom, B.Berit Olofsson, L.Lotta Happonen, D.Di Tang, M.Michael Nilges, J.Johan Malmström, W.Wael Bahnan, O.Oonagh Shannon, L.Lars Malmström and P.Pontus Nordenfelt. The increased hinge flexibility of an IgG1-IgG3 hybrid monoclonal enhances Fc-mediated protection against group A streptococci.bioRxiv2023, URL: https://www.biorxiv.org/content/early/2023/10/18/2023.10.14.562368DOI back to text
46 articleJ.J. Jumper, R.R. Evans, A.A. Pritzel, T.T. Green, M.M. Figurnov, O.O. Ronneberger, K.K. Tunyasuvunakool, R.R. Bates, A.A. Žídek, A.A. Potapenko, A.A. Bridgland, C.C. Meyer, S. A.S. A. A. Kohl, A. J.A. J. Ballard, A.A. Cowie, B.B. Romera-Paredes, S.S. Nikolov, R.R. Jain, J.J. Adler, T.T. Back, S.S. Petersen, D.D. Reiman, E.E. Clancy, M.M. Zielinski, M.M. Steinegger, M.M. Pacholska, T.T. Berghammer, S.S. Bodenstein, D.D. Silver, O.O. Vinyals, A. W.A. W. Senior, K.K. Kavukcuoglu, P.P. Kohli and D.D. Hassabis. Highly accurate protein structure prediction with AlphaFold.Nature596787308 2021, 583--589DOI back to text
47 articleY.Y. Karami, F.F. Guyon, S.S. De Vries and P.P. ry. DaReUS-Loop: accurate loop modeling using fragments from remote or unrelated proteins.Sci Rep81Sep 2018, 13673back to text
48 articleY.Y. Karami, E.E. Laine and A.A. Carbone. Dissecting protein architecture with communication blocks and communicating segment pairs.BMC Bioinformatics17 Suppl 2Suppl 2Jan 2016, 13back to text back to text
49 articleY.Y. Karami, S.S. Murail, J.J. Giribaldi, B.B. Lefranc, F.F. Defontaine, O.O. Lesouhaitier, J.J. Leprince, S.S. de Vries and P.P. Tuffery. Exploring a Structural Data Mining Approach to Design Linkers for Head-to-Tail Peptide Cyclization.J Chem Inf Model6320Oct 2023, 6436--6450back to text
50 articleH.H. Laroussi, Y.Y. Aoudache, E.E. Robert, V.V. Libante, L.L. Thiriet, D.D. Mias-Lucquin, B.B. Douzi, Y.Y. Roussel, I.I. ne, N.N. Soler and N.N. Leblond-Bourget. Exploration of DNA processing features unravels novel properties of ICE conjugation in Gram-positive bacteria.Nucleic Acids Res5014Aug 2022, 8127--8142back to text
51 articleJ.J. Liu, Z.Z. Guo, T.T. Wu, R. S.R. S. Roy, F.F. Quadir, C.C. Chen and J.J. Cheng. Enhancing alphafold-multimer-based protein complex structure prediction with MULTICOM in CASP15.Commun Biol61Nov 2023, 1140back to text
52 articleF.F. MacLean. Knowledge graphs and their applications in drug discovery.Expert Opin Drug Discov169Sep 2021, 1057--1069back to text
53 articleC.C. Maris, C.C. Dominguez and F. H.F. H. Allain. The RNA recognition motif, a plastic RNA-binding platform to regulate post-transcriptional gene expression.FEBS J2729May 2005, 2118--2131back to text
54 articleA.A. May and M.M. Zacharias. Energy minimization in low-frequency normal modes to efficiently allow for global flexibility during systematic protein-protein docking.Proteins702008, 794--809back to text
55 articleD.Dominique Mias-Lucquin and I.Isaure Chauvot de Beauchene. Conformational variability in proteins bound to single-stranded DNA: A new benchmark for new docking perspectives.Proteins: Structure, Function, and Bioinformatics9032022, 625-631URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/prot.26258DOI back to text
56 articleJ.J. Mistry, S.S. Chuguransky, L.L. Williams, M.M. Qureshi, G. A.G. A. Salazar, E. L.E. L. L. Sonnhammer, S. C.S. C. E. Tosatto, L.L. Paladin, S.S. Raj, L. J.L. J. Richardson, R. D.R. D. Finn and A.A. Bateman. Pfam: The protein families database in 2021.Nucleic Acids Res49D101 2021, D412-D419DOI back to text back to text
57 articleI. H.I. H. Moal and P. A.P. A. Bates. SwarmDock and the Use of Normal Modes in Protein-Protein Docking.International Journal of Molecular Sciences11102010, 3623--3648back to text
58 articleC.C. Morris. Towards a structural biology work bench.Acta CrystallographicaPD692013, 681--682back to text
59 articleD.Diana Mustard and D.David Ritchie. Docking essential dynamics eigenstructures.Proteins: Structure, Function, and Genetics602005, 269-274HAL DOI back to text
60 articleD. N.D. N. Nicholson and C. S.C. S. Greene. Constructing knowledge graphs and their biomedical applications.Comput Struct Biotechnol J182020, 1414--1428DOI back to text
61 articleG.G. Preud'homme, K.K. Duarte, K.K. Dalleau, C.C. Lacomblez, E.E. Bresso, M.M. l-Tabbone, M.M. Couceiro, M. D.M. D. Devignes, M.M. Kobayashi, O.O. Huttin, J. P.J. P. Ferreira, F.F. Zannad, P.P. Rossignol and N.N. Girerd. Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark.Sci Rep111Feb 2021, 4202back to text
62 articleD. J.Daniel J Rigden and X. M.Xosé M Fernández. The 2021 Nucleic Acids Research database issue and the online molecular biology database collection.Nucleic Acids Research49D112 2020, D1-D9URL: https://doi.org/10.1093/nar/gkaa1216DOI back to text
63 articleD. W.David W. Ritchie. Calculating and scoring high quality multiple flexible protein structure alignments.Bioinformatics321705 2016, 2650-2658URL: https://doi.org/10.1093/bioinformatics/btw300DOI back to text
64 articleA.A. Rivera-Calzada, R.R. Fronzes, C. G.C. G. Savva, V.V. Chandran, P. W.P. W. Lian, T.T. Laeremans, E.E. Pardon, J.J. Steyaert, H.H. Remaut, G.G. Waksman and E. V.E. V. Orlova. Structure of a bacterial type IV secretion core complex at subnanometre resolution.EMBO Journal322013, 1195--1204back to text
65 articleM. E.Maria Elisa Ruiz Echartea, D.David Ritchie and I.Isaure Chauvot de Beauchêne. Using Restraints in EROS-Dock Improves Model Quality in Pairwise and Multicomponent Protein Docking.Proteins - Structure, Function and Bioinformatics888August 2020, 1121-1128HAL DOI back to text
66 articleM. G.M. G. Saunders and G. A.G. A. Voth. Coarse-graining of multiprotein assemblies.Current Opinion in Structural Biology222012, 144--150back to text
67 articleI.I. Sillitoe, N.N. Bordin, N.N. Dawson, V. P.V. P. Waman, P.P. Ashford, H. M.H. M. Scholes, C. S.C. S. M. Pang, L.L. Woodridge, C.C. Rauer, N.N. Sen, M.M. Abbasian, S.S. Le Cornu, S. D.S. D. Lam, K.K. Berka, I. H.I. H. Varekova, R.R. Svobodova, J.J. Lees and C. A.C. A. Orengo. CATH: increased structural coverage of functional space.Nucleic Acids Res49D101 2021, D266-D273DOI back to text back to text
68 articleN.Nicolas Soler, E.Emilie Robert, I.Isaure Chauvot de Beauchêne, P.Philippe Monteiro, V.Virginie Libante, B.Bernard Maigret, J.Johan Staub, D. W.David W. Ritchie, G.Gérard Guédon, S.Sophie Payot, M.-D.Marie-Dominique Devignes and N. N.Nathalie N. Leblond-Bourget. Characterization of a relaxase belonging to the MOBT family, a widespread family in Firmicutes mediating the transfer of ICEs.Mobile DNA101December 2019, 1-16HAL DOI back to text back to text
69 articleA.Athéna\"is Vaginay, T.Taha Boukhobza and M.Malika Sma\"il-Tabbone. From quantitative SBML models to Boolean networks.Applied Network Science71December 2022, 73HAL DOI back to text back to text
70 inproceedingsA.Athéna\"is Vaginay, M.Malika Smail-Tabbone and T.Taha Boukhobza. Towards an automatic conversion from SBML core to SBML qual.JOBIM 2019 - Journées Ouvertes Biologie, Informatique et MathématiquesPrésentation PosterNantes, FranceJuly 2019HAL back to text
71 articleV.Vishwesh Venkatraman and D.David Ritchie. Flexible protein docking refinement using pose-dependent normal mode analysis.Proteins809June 2012, 2262-2274HAL DOI back to text
72 articleA. B.A. B. Ward, A.A. Sali and I. A.I. A. Wilson. Integrative Structural Biology.Biochemistry61222013, 913--915back to text

CAPSID - 2023

CAPSID - 2023

2023Activity reportProject-TeamCAPSID

Keywords

Computer Science and Digital Science

Other Research Topics and Application Domains

1 Team members, visitors, external collaborators

Research Scientists

Faculty Members

Post-Doctoral Fellow

PhD Students

Technical Staff

Interns and Apprentices

Administrative Assistants

External Collaborators

2 Overall objectives

2.1 Computational Challenges in Structural Biology

2.2 Two Research Axes

3 Research program

3.1 Knowledge Discovery in Structural Databases

3.1.1 Context

3.1.2 Knowledge discovery from protein structural databases

3.1.3 Function Annotation in Large Protein Graphs

3.1.4 Knowledge discovery algorithms in large biological knowledge graphs

3.2 Integrative Multi-Component Assembly and Modelling

3.2.1 Context

3.2.2 Coarse-Grained Models

3.2.3 Assembling Multi-Component Complexes and Integrative Structure Modelling

3.2.4 Protein-Nucleic Acid Interactions

4 Application domains

4.1 Biomedical Knowledge Discovery

4.2 Prokaryotic Type IV Secretion Systems

4.3 Protein - RNA Interactions

4.4 3D structural differences among HLA antigens

4.5 Investigating the dynamical behaviour of protein complexes with MD simulation

Allosteric pathways in protein-nucleic acid complexes.

Dynamics of Type IV Pilus.

4.6 Learning from protein surfaces to accurately predict their functional sites.

5 Social and environmental responsibility

5.1 Environmental Footprint of Research Activities

6 New software, platforms, open data

6.1 New software

6.1.1 CROMAST

6.1.2 HIPPO

6.1.3 RRMpip

6.1.4 ssRNATTRACT

6.2 New platforms

6.3 Open data

7 New results

7.1 Axis 1 : Knowledge Discovery in Structural Databases

7.1.1 Knowledge graph mining with embedding-based methods

7.1.2 Biological network modeling

7.1.3 Machine Learning and Scalable Graph-based Approaches

7.2 Axis 2 : Integrative Multi-Component Assembly and Modeling

7.2.1 Modeling and design of RNA-RRM complexes

Empirical inference of an RNA-protein energy function.

Search of self-avoiding paths for fragment-based docking.

Cross-mapping of protein domain structural instances.

7.2.2 3D Modeling of proteins and protein complexes

Modeling protein-DNA complexes to fight antibiotic resistance.

Structural basis of donor-specific antibody response in graft rejection.

7.2.3 Investigating the dynamical behaviour of protein complexes with MD simulation

Analysis of COVID-19 spike binding to ACE2.

Nucleosomes and post-translational modifications.

M1-IgG interaction network and dynamics.

Unraveling the complexity of glycosaminoglycan deficiencies caused by B3GALT6 loss-of-function mutations: a multifaceted approach

7.2.4 Machine learning methods for proteomics, interactomics and protein design

Deep Learning and de novo MS-based peptide sequencing.

Machine learning and protein design.

7.2.5 Miscellaneous results on structural studies of host-pathogen interactions

Mutational analysis of vinculin and cell adhesion

Functional proteomics to reveal streptolysin O as a novel plasminogen-binding streptococcal protein

8 Partnerships and cooperations

8.1 International initiatives

8.1.1 Participation in other International Programs

NewDAFI

8.2 International research visitors

8.2.1 Visits of international scientists

Other international visits to the team

Pr Edward Egelman