Many of the processes within living organisms can be studied and understood in terms of biochemical interactions between large macromolecules such as DNA, RNA, and proteins. To a first approximation, DNA may be considered to encode the blueprint for life, whereas proteins and RNA make up the three-dimensional (3D) molecular machinery. Many biological processes are governed by complex systems of proteins which interact cooperatively to regulate the chemical composition within a cell or to carry out a wide range of biochemical processes such as photosynthesis, metabolism, and cell signalling, for example. It is becoming increasingly feasible to isolate and characterise some of the individual protein components of such systems, but it still remains extremely difficult to achieve detailed models of how these complex systems actually work. Consequently, a new multidisciplinary approach called integrative structural biology has emerged which aims to bring together experimental data from a wide range of sources and resolution scales in order to meet this challenge , .
Understanding how biological systems work at the level of 3D molecular structures presents fascinating challenges for biologists and computer scientists alike. Despite being made from a small set of simple chemical building blocks, protein molecules have a remarkable ability to self-assemble into complex molecular machines which carry out very specific biological processes. As such, these molecular machines may be considered as complex systems because their properties are much greater than the sum of the properties of their component parts.
The overall objective of the Capsid team is to develop algorithms and software to help study biological systems and phenomena from a structural point of view. In particular, the team aims to develop algorithms which can help to model the structures of large multi-component biomolecular machines and to develop tools and techniques to represent and mine knowledge of the 3D shapes of proteins and protein-protein interactions. Thus, a unifying theme of the team is to tackle the recurring problem of representing and reasoning about large 3D macromolecular shapes. More specifically, our aim is to develop computational techniques to represent, analyse, and compare the shapes and interactions of protein molecules in order to help better understand how their 3D structures relate to their biological function. In summary, the Capsid team focuses on the following closely related topics in structural bioinformatics:
new approaches for knowledge discovery in structural databases,
integrative multi-component assembly and modeling.
As indicated above, structural biology is largely concerned with determining the 3D atomic structures of proteins and RNA molecules, and then using these structures to study their biological properties and interactions. Each of these activities can be extremely time-consuming. Solving the 3D structure of even a single protein using X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy can often take many months or even years of effort. Even simulating the interaction between two proteins using a detailed atomistic molecular dynamics simulation can consume many thousands of CPU-hours. While most X-ray crystallographers, NMR spectroscopists, and molecular modelers often use conventional sequence and structure alignment tools to help propose initial structural models through the homology principle, they often study only individual structures or interactions at a time. Due to the difficulties outlined above, only relatively few research groups are able to solve the structures of large multi-component systems.
Similarly, most current algorithms for comparing protein structures, and especially those for modeling protein interactions, work only at the pair-wise level. Of course, such calculations may be accelerated considerably by using dynamic programming (DP) or fast Fourier transform (FFT) techniques. However, it remains extremely challenging to scale up these techniques to model multi-component systems. For example, the use of high performance computing (HPC) facilities may be used to accelerate arithmetically intensive shape-matching calculations, but this generally does not help solve the fundamentally combinatorial nature of many multi-component problems. It is therefore necessary to devise heuristic hybrid approaches which can be tailored to exploit various sources of domain knowledge. We therefore set ourselves the following main computational objectives:
classify and mine protein structures and protein-protein interactions,
develop multi-component assembly techniques for integrative structural biology.
The scientific discovery process is very often based on cycles of measurement, classification, and generalisation. It is easy to argue that this is especially true in the biological sciences. The proteins that exist today represent the molecular product of some three billion years of evolution. Therefore, comparing protein sequences and structures is important for understanding their functional and evolutionary relationships , . There is now overwhelming evidence that all living organisms and many biological processes share a common ancestry in the tree of life. Historically, much of bioinformatics research has focused on developing mathematical and statistical algorithms to process, analyse, annotate, and compare protein and DNA sequences because such sequences represent the primary form of information in biological systems. However, there is growing evidence that structure-based methods can help to predict networks of protein-protein interactions (PPIs) with greater accuracy than those which do not use structural evidence , . Therefore, developing techniques which can mine knowledge of protein structures and their interactions is an important way to enhance our knowledge of biology .
Often, proteins may be divided into modular sub-units called domains, which can be associated with specific biological functions. Thus, a protein domain may be considered as the evolutionary unit of biological structure and function . However, while it is well known that the 3D structures of protein domains are often more evolutionarily conserved than their one-dimensional (1D) amino acid sequences, comparing 3D structures is much more difficult than comparing 1D sequences. However, until recently, most evolutionary studies of proteins have compared and clustered 1D amino acid and nucleotide sequences rather than 3D molecular structures.
A pre-requisite for the accurate comparison of protein structures is to have a reliable method for quantifying the structural similarity between pairs of proteins. We recently developed a new protein structure alignment program called Kpax which combines an efficient dynamic programming based scoring function with a simple but novel Gaussian representation of protein backbone shape . This means that we can now quantitatively compare 3D protein domains at a similar rate to throughput to conventional protein sequence comparison algorithms. We recently compared Kpax with a large number of other structure alignment programs, and we found Kpax to be the fastest and amongst the most accurate, in a CATH family recognition test . The latest version of Kpax can calculate multiple flexible alignments, and thus promises to avoid such issues when comparing more distantly related protein folds and fold families.
Concerning protein structure classification, we aim to explore novel classification paradigms to circumvent the problems encountered with existing hierarchical classifications of protein folds and domains. In particular it will be interesting to set up fuzzy clustering methods taking advantage of our previous work on gene functional classification , but instead using Kpax domain-domain similarity matrices. A non-trivial issue with fuzzy clustering is how to handle similarity rather than mathematical distance matrices, and how to find the optimal number of clusters, especially when using a non-Euclidean similarity measure. We will adapt the algorithms and the calculation of quality indices to the Kpax similarity measure. More fundamentally, it will be necessary to integrate this classification step in the more general process leading from data to knowledge called Knowledge Discovery in Databases (KDD) .
Another example where domain knowledge can be useful is during result interpretation: several sources of knowledge have to be used to explicitly characterise each cluster and to help decide its validity. Thus, it will be useful to be able to express data models, patterns, and rules in a common formalism using a defined vocabulary for concepts and relationships. Existing approaches such as the Molecular Interaction (MI) format developed by the Human Genome Organization (HUGO) mostly address the experimental wet lab aspects leading to data production and curation . A different point of view is represented in the Interaction Network Ontology (INO), a community-driven ontology that aims to standardise and integrate data on interaction networks and to support computer-assisted reasoning . However, this ontology does not integrate basic 3D concepts and structural relationships. Therefore, extending such formalisms and symbolic relationships will be beneficial, if not essential, when classifying the 3D shapes of proteins at the domain family level.
A widely used collection of protein domain families is “Pfam”
, constructed from multiple alignments of protein sequences.
Integrating domain-domain similarity measures with knowledge about domain
binding sites, as introduced by us in our KBDOCK approach
, ,
can help in selecting interesting subsets of domain pairs before clustering.
Thanks to our KBDOCK and Kpax projects,
we already have a rich set of tools with which we can start to process and
compare all known protein structures and PPIs according to their
component Pfam domains.
Linking this new classification to the latest “SIFTS”
(Structure Integration with Function, Taxonomy and Sequence)
functional annotations between standard UniProt (http://
Knowledge of the functional properties of proteins can shed considerable light on how they might interact. However, huge numbers of protein sequences in public databases lack any functional annotation, and the annotation of sequences in such databases is a highly challenging problem. We are developing graph-based and machine learning techniques to annotate automatically the available unannotated sequences in such databases with functional properties such as EC numbers and Gene Ontology (GO) terms. Even if the 3D structures of proteins are unknown, it is natural to suppose that their sequences may be related to each other by the domains, domain families, and super-families that they share. In the frame of the PhD project of Bishnu Sarker, we recently developed a novel graph-based approach called GrAPFI for the automatic functional annotation of protein sequences based on these principles in order to transfer annotations from expert-reviewed sequences to unreviewed sequences in the UniProtKB databases , .
At the molecular level, each PPI is embodied by a physical 3D protein-protein interface. Therefore, if the 3D structures of a pair of interacting proteins are known, it should in principle be possible for a docking algorithm to use this knowledge to predict the structure of the complex. However, modeling protein flexibility accurately during docking is very computationally expensive due to the very large number of internal degrees of freedom in each protein, associated with twisting motions around covalent bonds. Therefore, it is highly impractical to use detailed force-field or geometric representations in a brute-force docking search. Instead, most protein docking algorithms use fast heuristic methods to perform an initial rigid-body search in order to locate a relatively small number of candidate binding orientations, and these are then refined using a more expensive interaction potential or force-field model, which might also include flexible refinement using molecular dynamics (MD), for example.
In our Hex protein docking program , the shape of a protein molecule is represented using polar Fourier series expansions of the form
where
This equation demonstrates using the notion of overlap between 3D scalar quantities to give a physics-based scoring function. If the aim is to find the configuration that gives the most favourable interaction energy, then it is necessary to perform a six-dimensional search in the space of available rotational and translational degrees of freedom. By re-writing the polar Fourier expansions using complex spherical harmonics, we showed previously that fast Fourier transform (FFT) techniques may be used to accelerate the search in up to five of the six degrees of freedom . Furthermore, we also showed that such calculations may be accelerated dramatically on modern graphics processor units , . Consequently, we are continuing to explore new ways to exploit the polar Fourier approach.
Although protein-protein docking algorithms are improving
, ,
it still remains challenging to produce a
high resolution 3D model of a protein complex using ab initio techniques,
mainly due to the problem of structural flexibility described above.
However, with the aid of even just one simple constraint on the docking search
space, the quality of docking predictions can improve
considerably , .
In particular, many protein complexes involve symmetric arrangements of
one or more sub-units, and the presence of symmetry may be exploited to
reduce the search space considerably
, , .
For example,
using our operator notation
(in which
where the identical monomers A and B are initially placed at the origin,
and
Many approaches have been proposed in the literature to take into account protein flexibility during docking. The most thorough methods rely on expensive atomistic simulations using MD. However, much of a MD trajectory is unlikely to be relevant to a docking encounter unless it is constrained to explore a putative protein-protein interface. Consequently, MD is normally only used to refine a small number of candidate rigid body docking poses. A much faster, but more approximate method is to use CG normal mode analysis (NMA) techniques to reduce the number of flexible degrees of freedom to just one or a handful of the most significant vibrational modes , , , . In our experience, docking ensembles of NMA conformations does not give much improvement over basic FFT-based soft docking , and it is very computationally expensive to use side-chain repacking to refine candidate soft docking poses .
In the last few years, CG force-field models have become increasingly popular in the MD community because they allow very large biomolecular systems to be simulated using conventional MD programs . Typically, a CG force-field representation replaces the atoms in each amino acid with from 2 to 4 “pseudo-atoms”, and it assigns each pseudo-atom a small number of parameters to represent its chemo-physical properties. By directly attacking the quadratic nature of pair-wise energy functions, coarse-graining can speed up MD simulations by up to three orders of magnitude. Nonetheless, such CG models can still produce useful models of very large multi-component assemblies . Furthermore, this kind of coarse-graining effectively integrates out many of the internal DOFs to leave a smoother but still physically realistic energy surface . We are therefore developing a “coarse-grained” scoring function for fast protein-protein docking and multi-component assembly in the frame of the PhD project of Maria-Elisa Ruiz-Echartea , .
We also want to develop related approaches for integrative structure modeling using cryo-electron microscopy (cryo-EM). Thanks to recently developments in cryo-EM instruments and technologies, it is now feasible to capture low resolution images of very large macromolecular machines. However, while such developments offer the intriguing prospect of being able to trap biological systems in unprecedented levels of detail, there will also come an increasing need to analyse, annotate, and interpret the enormous volumes of data that will soon flow from the latest instruments. In particular, a new challenge that is emerging is how to fit previously solved high resolution protein structures into low resolution cryo-EM density maps. However, the problem here is that large molecular machines will have multiple sub-components, some of which will be unknown, and many of which will fit each part of the map almost equally well. Thus, the general problem of building high resolution 3D models from cryo-EM data is like building a complex 3D jigsaw puzzle in which several pieces may be unknown or missing, and none of which will fit perfectly. We wish to proceed firstly by putting more emphasis on the single-body terms in the scoring function , and secondly by using fast CG representations and knowledge-based distance restraints to prune large regions of the search space (thesis project of Maria Elisa Ruiz Echartea).
This projects in this domain are carried out in collaboration with the Orpailleur Team.
Huge and ever increasing amounts of biomedical data (“Big Data”) are bringing new challenges and novel opportunities for knowledge discovery in biomedicine. We are actively collaborating with biologists and clinicians to design and implement approaches for selecting, integrating, and mining biomedical data in various areas. In particular, we are focusing on leveraging bio-ontologies at all steps of this process (the main thesis topic of Gabin Personeni, co-supervised by Marie-Dominique Devignes and Adrien Coulet from the Orpailleur team). One specific application concerns exploiting Linked Open Data (LOD) to characterise the genes responsible for intellectual deficiency. This work is in collaboration with Pr. P. Jonveaux of the Laboratoire de Génétique Humaine at CHRU Nancy , . This involves using inductive logic programming as a machine learning method and at least three different ontologies (Gene Ontology, Human Phenotype Ontology, and Disease Ontology). This approach has also been applied using pattern structure mining (an extension of formal concept analysis) of drug and disease ontologies to discover frequently associated adverse drug events in patients . This work was performed in collaboration with the Centre for BioMedical Informatics Research (BMIR) at Stanford University.
Recently, a new application for biomedical knowledge discovery has emerged from the ANR “FIGHT-HF” (fight heart failure) project, which is in collaboration with several INSERM teams at CHRU Nancy. In this case, the molecular mechanisms that underly HF at the cellular and tissue levels will be considered against a background of all available data and ontologies, and represented in a single integrated complex network. A network platform is under construction with the help of a young start-up company called Edgeleap. Together with this company, we are developing query and analysis facilities to help biologists and clinicians to identify relevant biomarkers for patient phenotyping . Docking of small molecules on candidate receptors, as well as protein-protein docking will also be used to clarify a certain number of relations in the complex HF network.
Prokaryotic type IV secretion systems constitute a fascinating example of a family of nanomachines capable of translocating DNA and protein molecules through the cell membrane from one cell to another . The complete system involves at least 12 proteins. The structure of the core channel involving three of these proteins has recently been determined by cryo-EM experiments , . However, the detailed nature of the interactions between the remaining components and those of the core channel remains to be resolved. Therefore, these secretion systems represent another family of complex biological systems (scales 2 and 3) that call for integrated modeling approaches to fully understand their machinery.
In the frame of the Lorraine Université d'Excellence (LUE-FEDER) “CITRAM” project MD Devignes is pursuing her collaboration with Nathalie Leblond of the Genome Dynamics and Microbial Adaptation (DynAMic) laboratory (UMR 1128, Université de Lorraine, INRA) on the discovery of new integrative conjugative elements (ICEs) and integrative mobilisable elements (IMEs) in prokaryotic genomes. These elements use Type IV secretion systems for transferring DNA horizontally from one cell to another. We have discovered more than 200 new ICEs/IMEs by systematic exploration of 72 Streptococcus genome. As these elements encode all or a subset of the components of the Type IV secretion system, they constitute a valuable source of sequence data and constraints for modeling these systems in 3D. Another interesting aspect of this particular system is that unlike other secretion systems, the Type IV secretion systems are not restricted to a particular group of bacteria .
As well as playing an essential role in the translation of DNA into proteins, RNA molecules carry out many other essential biological functions in cells, often through their interactions with proteins. A critical challenge in modelling such interactions computationally is that the RNA is often highly flexible, especially in single-stranded (ssRNA) regions of its structure. These flexible regions are often very important because it is through their flexibility that the RNA can adjust its 3D conformation in order to bind to a protein surface. However, conventional protein-protein docking algorithms generally assume that the 3D structures to be docked are rigid, and so are not suitable for modeling protein-RNA interactions. There is therefore much interest in developing protein-RNA docking algorithms which can take RNA flexibility into account.
We are currently developing a novel flexible docking algorithm which first docks small fragments of ssRNA (typically three nucleotides at a time) onto a protein surface, and then combinatorially reassembles those fragments in order to recover a contiguous ssRNA structure on the protein surface , . We have since implemented a prototype “forward-backward” dynamic programming algorithm with stochastic backtracking that allows us to model protein RNA interactions for ssRNAs of up to 7 nucleotides without requiring any prior knowledge of the interaction, while still avoiding a brute-force search. In the frame of our PEPS collaboration “InterANRIL” with the IMoPA lab (University of Lorraine), we are currently working with biologists to apply the approach to modeling certain long non-coding RNA (lncRNA) complexes. In order to extend this approach to partially structured RNA molecules, we have built an automated pipeline to create (i) libraries of RNA fragments with arbitrary characteristics such as secondary structure, and (ii) testing benchmarks for applying those libraries in docking. In the frame of our LUE-FEDER CITRAM project we adapted this approach and this pipeline to DNA docking in order to model the complex formed by a bacterial relaxase and its target DNA.
Isaure Chauvot de Beauchêne has obtained H2020 funding for two international PhD students under the MSCA-ITN programme. The project will study protein/RNA interactions, and will start on 01/01/2019.
Keywords: 3D rendering - Bioinformatics - 3D interaction - Structural Biology
Scientific Description: Hex is an interactive protein docking and molecular superposition program for Linux Mac-OS and Windows-XP. Hex understands protein and DNA structures in PDB format, and it can also read small-molecule SDF files. The recent versions now include CUDA support for Nvidia GPUs. On a modern workstation, docking times range from a few minutes or less when the search is constrained to known binding sites, to about half an hour for a blind global search (or just a few seconds with CUDA).
Functional Description: The underlying algorithm uses a novel polar Fourier correlation technique to accelerate the search for close-fitting orientations of the two molecules.
Participant: David Ritchie
Contact: David Ritchie
URL: http://
Keyword: 3D interaction
Scientific Description: Kbdock is a database of 3D protein domain-domain interactions with a web interface.
Functional Description: The Kbdock database is built from a snapshot of the Protein Databank (PDB) in which all 3D structures are cut into domains according to the Pfam domain description. A web interface allows 3D domain-domain interactions to be compared by Pfam family.
Authors: Anisah Ghoorah, David Ritchie and Marie-Dominique Devignes
Contact: David Ritchie
Keywords: Bioinformatics - Structural Biology
Scientific Description: Kpax is a program for aligning and superposing the 3D structures of protein molecules.
Functional Description: The algorithm uses a Gaussian representation of the protein backbone in order to construct a similarity score based on the 3D overlap of the Gaussians of the proteins to be superposed. Multiple proteins may be aligned together (multiple structural alignment) and databases of protein structures may be searched rapidly.
Participant: David Ritchie
Contact: David Ritchie
Protein Symmetry Assembler
Keywords: Proteins - Structural Biology
Scientific Description: Sam is a program for making symmetrical protein complexes, starting from a single monomer.
Functional Description: The algorithm searches for good docking solutions between protein monomers using a spherical polar Fast Fourier transform correlation in which symmetry restraints are built into the calculation. Thus every candidate solution is guaranteed to have the desired symmetry.
Authors: David Ritchie and Sergey Grudinin
Partner: CNRS
Contact: David Ritchie
URL: http://
Keywords: 3D reconstruction - Cryo-electron microscopy - Fitting
Scientific Description: A program for fitting high resolution 3D protein structures into low resolution cryo-EM density maps.
Functional Description: A highly parallel fast Fourier transform (FFT) EM density fitting program which can exploit the special hardware properties of modern graphics processor units (GPUs) to accelerate both the translational and rotational parts of the correlation search.
Authors: Van-Thai Hoang and David Ritchie
Contact: David Ritchie
ECDomainMiner
Keyword: Functional annotation
Scientific Description: EC-DomainMiner uses a recommender-based approach for associating EC (Enzyme Commission) numbers with protein Pfam domains from EC-sequence relationships that have been annotated previously in the SIFTS and Uniprot databases.
Functional Description: A program to associate protein Enzyme Commission numbers with Pfam domains
Contact: David Ritchie
URL: http://
GO-DomainMiner
Keyword: Functional annotation
Functional Description: GO-DomainMiner is is a graph-based approach for associating GO (gene ontology) terms with protein Pfam domains.
Contact: David Ritchie
URL: http://
A Block-centric graph processing framework for LArge Dynamic Graphs
Keywords: Distributed computing - Dynamic graph processing
Functional Description: BLADYG is a block-centric framework that addresses the issue of dynamism in large-scale graphs. BLADYG starts its computation by collecting the graph data from various data sources. After collecting the graph data, BLADYG partitions the input graph into multiple partitions. Each BLADYG worker loads its block/partition and performs both local and remote computations, after which the status of the blocks is updated. The BLADYG coordinator orchestrates the execution of the considered graph operation in order to deal with graph updates.
Partner: University of Trento
Contact: Sabeur Aridhi
Clebsch-Gordan Coefficients
Keywords: Clebsch-Gordan coupling coefficient - 3j symbol
Functional Description: Clebsch-Gordan coupling coefficients appear in many areas of physics and chemistry. CGC is a small library of functions and a demo driver program for calculating Clebsch-Gordan coupling coefficients up to very high principal quantum numbers.
Contact: David Ritchie
URL: http://
GrAPFI: Graph-based Automatic Protein Function Inference
Keyword: Proteins
Functional Description: GrAPFI is a Graph-based Automatic Protein Function Inference tool that aims to annotate protein sequences with EC numbers. The underlying philosophy of GrAPFI assumes that proteins can be linked through the domains, families, and superfamilies that they share. Several domain databases exist such as e.g. Pfam, SMART, CDD, Gene3D, and Prosite. Furthermore, InterPro aims to integrate information from all such databases by assigning them unique InterPro signatures. GrAPFI tool also shares Interpro signatures, as it includes information from several major family and domain databases. Our computational analysis and cross-validation show that GrAPFI achieves state-of-the-art performance in EC number prediction.
Contact: Sabeur Aridhi
The MBI (Modeling Biomolecular Interactions) platform (http://
Contact: Marie-Dominique Devignes
Identifying new molecular targets using comparative genomics and knowledge of disease mechanisms is a rational first step in the search for new preventative or therapeutic drug treatments . We are mostly concerned with three global health problems, namely fungal and bacterial infections and hypertension. Through on-going collaborations with several Brazilian laboratories (at University of Mato Grosso State, University of Maringá, Embrapa, and University of Brasilia), we previously identified several novel small-molecule drug leads against Trypanosoma cruzi, a parasite responsible for Chagas disease . With the University of Maringá, we subsequently found several active molecules against the flavoenzyme TRR1 in Candida albicans, and two manuscripts are in preparation. We also proposed several small-molecule inhibitors against Fusarium graminearum, a fungal threat to global wheat production , . Two further manuscripts on this topic are currently in preparation. Concerning hypertension, we continued our collaboration with Prof. Catherine Llorens-Cortes at Collège de France to study the interaction between the apelin receptor (a transmembrane protein important for blood pressure regulation) and the aminopetidase A enzyme .
It is well known that many therapeutic drug molecules can have adverse side effects. However, when patients take several combinations of drugs it can be difficult to determine which drug is responsible for which side effect. In collaboration with Adrien Coulet (Orpailleur team co-supervisor of Gabin Personeni) and Prof. Michel Dumontier (Biomedical Informatics Research Laboratory, Stanford), we developed an approach which combines multiple ontologies such as the Anatomical Therapeutical Classification of Drugs, the ICD-9 classification of diseases, and the SNOMED-CT medical vocabulary together with the use of Pattern Structures (an extension of Formal Concept Analysis) in order to extract association rules to analyse the co-occurrence of adverse drug effects in patient records , . A paper describing this work has been published in the Journal of Biomedical Semantics .
Many proteins form symmetrical complexes in which each structure contains two or more identical copies of the same sub-unit. We recently developed a novel polar Fourier docking algorithm called “Sam” for automatically assembling symmetrical protein complexes. A journal article describing the Sam algorithm has been published . An article describing the results obtained when using Sam to dock several symmetrical protein complexes from the “CASP/CAPRI” docking experiment has also been published . This study showed that many of the models of protein structures built by members of the “CASP” fold prediction community are “dockable” in the sense that Sam is able to find acceptable docking solutions from amongst the CASP models.
More recently, we are working to extend the polar Fourier correlation algorithm to use very high angular resolution spherical Bessel basis functions. As part of this work, we have developed a very fast recursive algorithm for calculating high order Clebsch-Gordan coupling coefficients . A manuscript describing this work has been submitted to a quantum mechanics journal.
Comparing two or more proteins by optimally aligning and superposing their backbone structures provides a way to detect evolutionary relationships between proteins that cannot be detected by comparing only their primary amino-acid sequences. The latest version of our “Kpax” protein structure alignment algorithm can flexibly align pairs of structures that cannot be completely superposed by a single rigid-body transformation, and can calculate multiple alignments of several similar structures flexibly . In collaboration with Alain Hein of the INRA lab “Agronomie et Environnement”, we used Kpax to help study the structures of various “Cyp450” enzymes in plants . In collaboration with Emmanuel Levy of the Weizmann Institute, we used Kpax to superpose and compare all of the symmetrical protein complexes in the Protein Databank in order to verify or remediate their quaternary structure annotations. A manuscript describing this work has been published in Nature Methods .
Many protein chains in the Protein Data Bank (PDB) are cross-referenced with Pfam domains and Gene Ontology (GO) terms. However, these annotations do not explicitly indicate any relation between EC numbers and Pfam domains, and many others lack GO annotations. In order to address this limitation, as part of the PhD thesis project of Seyed Alborzi, we developed the CODAC approach for mining multiple protein data sources (i.e. SwissProt, TremBL, and SIFTS) in order to associate GO molecular function terms with Pfam domains, for example. We named the software implementation “GO-DomainMiner”. This work was first presented at IWBBIO 2017 . A full paper has recently been accepted for a special issue of BMC Bioinformatics .
In collaboration with Maria Martin's team at the European Bioinformatics Institute (EBI), we combined the CODAC approach with a novel combinatorial association rule based approach called “CARDM” for annotating protein sequences. When applied to the large UniProt/TrEMBL sequence database of 63 million protein entries, CARDM predicted over 24 million Enzyme Commission (EC) numbers and 188 million GO terms for those entries. A journal paper in collaboration with the EBI on comparing the quality of these predicted annotations with other state of the art annotation methods is in preparation, and a poster was presented at ISMB-ECCB-2017 . As part of the PhD thesis of Bishnu Sarker, we also developed GrAPFI, a graph-based protein function annotation approach. GrAPFI applies a label propagation algorithm to a complex network representation of protein sequence data. A full paper on this work has recently been accepted by the International Conference on Complex Networks and their Applications .
The huge number of protein sequences in protein databases such as UniProtKB calls for rapid procedures to annotate them automatically. We are using existing protein annotations to predict the annotations of new or non-reviewed proteins. In this context, we developed the “DistNBLP” method for annotating protein sequences using a graph representation and a distributed label propagation algorithm. DistNBLP uses the BLADYG framework to process protein graphs on multiple compute nodes by applying a neighbourhood-based label propagation algorithm in a distributed way. We applied DistNBLP in the recent “CAFA 3” (critical Assessment of Protein Function Annotation) community experiment to annotate new protein sequences automatically. This work was presented as a poster at ISMB/ECCB-2017 . We are also interested in feature selection for subgraph patterns. In collaboration with the LIMOS laboratory at Université Clermont Auvergne we also developed a scalable approach using MapReduce for identifying sub-graphs having similar labels in very large graphs .
Modeling how flexible polymers bind to proteins presents enormous computational challenges due to the large conformational search space that arises from the many internal rotational degrees of freedom in polymer structures. In collaboration with Sergey Samsonov (Gdansk University, Poland), we extended our fragment-based flexible docking approach , to model how flexible Glycosaminoglycans (GAGs) might bind to the surface of a known protein structure. A paper has been submitted to the Journal of Computational Chemistry.
In collaboration with Sjoerd de Vries (Univ Paris Diderot), we have created a new protein-glycan interaction force-field and integrated it in the ATTRACT docking engine . We also participated in a comparative study of the main current protein-GAG docking methods.
We have designed a method to compute similarities on unlabeled data using stochastic decision trees . The main idea of Unsupervised Extremely Randomized Trees (UET) is to randomly and iteratively split the data until a stopping criterion is met. Pairwise similarity values are computed based on the co-occurrence of samples in the leaves of each generated tree. We evaluate our method on synthetic and real-world datasets by comparing the mean similarities between samples with the same label and the mean similarities between samples with distinct labels. Empirical studies show that the method effectively gives distinct similarity values between samples belonging to distinct clusters, and gives indiscernible values when there is no cluster structure. We also assessed some interesting properties such as invariance under monotone transformations of variables and robustness to correlated variables and noise. Our experiments show that the algorithm outperforms existing methods in some cases, and can reduce the amount of preprocessing needed with many real-world datasets. We plan to study the application of this “global” pairwise similarity computation to quantify protein structural similarities. Two interesting problems will concern the representation of the protein structure and how to tackle extra constraints such as invariance under rotational and translational transformations.
Project title: Innovations Technologiques, Modélisation et Médecine Personnalisée; PI: Faiez Zannad, Univ Lorraine (Inserm-CHU-UL). Value: 14.4 M€ (“SMEC” platform – Simulation, Modélisation, Extraction de Connaissances – coordinated by Capsid and Orpailleur teams for Inria Nancy – Grand Est, with IECL and CHRU Nancy: 860 k€, approx); Duration: 2015–2020. Description: The IT2MP project encompasses four interdisciplinary platforms that support several scientific pôles of the university whose research involves human health. The SMEC platform supports research projects ranging from molecular modeling and dynamical simulation to biological data mining and patient cohort studies.
Project title: Conception d’Inhibiteurs du Transfert de Résistances aux agents Anti-Microbiens: bio-ingénierie assistée par des approches virtuelles et numériques, et appliquée à une relaxase d’élément conjugatif intégratif; PI: N. Leblond, Univ Lorraine (DynAMic, UMR 1128); Other partners: Chris Chipot, CNRS (SRSMSC, UMR 7565); Value: 200 k€ (Capsid: 80 k€); Duration: 2017–2018. Description: This project follows on from the 2016 PEPS project “MODEL-ICE”. The aim is to investigate protein-protein interactions required for initiating the transfer of an ICE (Integrated Conjugative Element) from one bacterial cell to another one, and to develop small-molecule inhibitors of these interactions.
Project title:
Criblage virtuel et dynamique moléculaire pour la recherche de bio-actifs
ciblant la
Project title: TBA Duration: 2017–2018. Description: TBA
Project title: Identification et modélisation des interactions nécessaires à l'activité du long ARN non-codant ANRIL dans la régulation épigénétique des gènes; PI: Sylvain Maenner, Univ Lorraine (IMoPA, UMR 7365); Value: 20 k€; Duration: 2017–2018. Description: ANRIL is a long non-coding RNA (lncRNA) which has been identified as an important factor in the susceptibility cardiovascular diseases. ANRIL is involved in the epigenetic regulation of the expression of a network of genes via mechanisms that are still largely unknown. This project aims to identify and model the protein-RNA and/or DNA-RNA interactions that ANRIL establishes within the eukaryotic genome.
GlycoEst is an informal working group which was recently created to develop an interdisciplinary regional network of Glyco-scientists. Isaure Chauvot de Beauchêne gave a talk on her protein-GAG docking method at the first meeting of this group in March 2018.
Project title: Structural bioinformatics server; PI: David Ritchie, Capsid (Inria Nancy – Grand Est); Value: 24 k€; Duration: 2015–2020. Description: This funding provides a small high performance computing server for structural bioinformatics research at the Inria Nancy – Grand Est centre.
Project title:
Combattre l’insuffisance cardiaque;
PI: Patrick Rossignol, Univ Lorraine (FHU-Cartage);
Partners: multiple;
Value: 9 m€ (Capsid and Orpailleur: 450 k€, approx);
Duration: 2015–2019.
Description:
This “Investissements d'Avenir” project aims to discover novel mechanisms
for heart failure and to propose decision support for precision medicine.
The project has been granted € 9M, and involves many participants from
Nancy University Hospital's
Federation “CARTAGE” (http://
Project title:
Institut Français de Bioinformatique;
PI: Claudine Médigue and Jacques van Helden (CNRS UMS 3601);
Partners: multiple;
Value: 20 M€ (Capsid: 126 k€);
Duration: 2014–2021.
Description:
The Capsid team is a research node of the IFB (Institut Français de Bioinformatique),
the French national network of bioinformatics platforms
(http://
EBI: European Bioinformatics Institute, Maria Martin team (UK). We are working with the EBI team to validate and improve our graph-based approaches for protein function annotation.
Protect Title: Oligo-RNA Combinatorial Assembly for 3D modeling of protein-RNA complexes. PI: Isaure Chauvot de Beauchêne. Value: 8k€. Description: The project aimed at improving our fragment-based ssRNA docking method, of which we already provided a proof of principle. It mainly provided grants for two internship students to work on (i) a new ssRNA-protein scoring function and (ii) docking with constraints specific to the geometry of ssRNA in RNA loops.
Project title: Analyzing big data with temporal graphs and machine learning: application to urban traffic analysis and protein function annotation. PI: Sabeur Aridhi; Partners: LORIA/Inria NGE, Federal University of Ceará (UFC); Value: 20 k€; Duration: 2017–2020. Description: This project aims to investigate and propose solutions for both urban traffic-related problems and protein annotation problems. In the case of urban traffic analysis, problems such as traffic speed prediction, travel time prediction, traffic congestion identification and nearest neighbors identification will be tackled. In the case of protein annotation problem, protein graphs and/or protein–protein interaction (PPI) networks will be modeled using dynamic time-dependent graph representations.
Participant: David Ritchie; Project: Integrative Modeling of 3D Protein Structures and Interactions; Partner: Rocasolano Institute of Physical Chemistry, Spain. Funding: Inria Nancy – Grand Est (“Nancy Emerging Associate Team”).
Participant: Bernard Maigret; Project: Characterization, expression and molecular modeling of TRR1 and ALS3 proteins of Candida spp., as a strategy to obtain new drugs with action on yeasts involved in nosocomial infections; Partner: State University of Maringá, Brazil.
Participant: Bernard Maigret; Project: Fusarium graminearum target selection; Partner: Embrapa Recursos Geneticos e Biotecnologia, Brazil.
Participant: Bernard Maigret; Project: The thermal shock HSP90 protein as a target for new drugs against paracoccidioidomicose; Partner: Brasília University, Brazil.
Participant: Bernard Maigret; Project: Protein-protein interactions for the development of new drugs; Partner: Federal University of Goias, Brazil.
Ghania Khensous from the University of Sciences and Technologies in Oman visited the team to develop a tabu-based search algorithm for flexible protein-ligand docking, under the supervision of Bernard Maigret.
Patricia Alves from the University of Brasilia is visiting the team to carry out drug repositioning on several target fungus proteins under the supervision of Bernard Maigret.
Agnibha Chandra from the Indian Institute of Engineering Science & Technology visited the team to optimize a force-field for ssRNA-protein docking, under the supervision of Isaure Chauvot de Beauchêne.
Ismail El Fadli from the Mohammed V University of Rabat visited the team to adapt our RNA-protein docking method to DNA-protein systems, under the supervision of Isaure Chauvot de Beauchêne.
Aichata Niang from the University of Paris Diderot visited the team to to apply modeling and virtual screening of a galactosyl-transferase enzyme in order to find new inhibitors, under the supervision of Isaure Chauvot de Beauchêne.
Rohit Roy from the Indian Institute of Technology at Kharagpur visited the team in order to include geometric constrains in our docking algorithm for docking RNA loops, under the supervision of Isaure Chauvot de Beauchêne.
Giammarco Mastronardi from the University of Lorraine visited the team to carry out a virtual screening study of small-molecule inhibitors of a bacterial polyketide synthase module, under the supervision of Bernard Maigret and David Ritchie.
Wissem Inoubli from the University of Tunis El Manar visited the team to work on his PhD thesis on distributed graph processing under the supervision of Sabeur Aridhi.
Damien Vantourout from the University of Lorraine visited the team to develop a tool for protein function annotation using semantic protein networks and deep neural networks under the supervision of Sabeur Aridhi.
Xavier Farchetto from the University of Lorraine visited the team to work on the segmentation of images in Crohn’s disease under the supervision of Malika Smaïl-Tabbone and Chedy Raissy (Orpailleur team).
Maxime Guyot from the University of Lorraine (Telecom Nancy stage 2A) visited the team to perform statistical analyses on KBDock and to develop a scoring function for protein-protein interaction subgraphs extracted from a knowledge graph database.
Marie-Dominique Devignes is a member of the Steering Committee for the
European Conference on Computational Biology (ECCB; http://
Malika Smaïl-Tabbone is a member of the steering committee of the Conférence Francophone sur la Recherche d’Information et ses Applications (CORIA).
Malika Smaïl-Tabbone is a member of the organizing committee for Inforsid 2018 and EGC 2018, and is in charge of the é-EGC winter school in Metz.
Marie-Dominique Devignes was a reviewer for IWBBIO and BIBM.
David Ritchie is a member of the editorial board of Scientific Reports.
Sabeur Aridhi is a member of the editorial board of Intelligent Data Analysis.
The members of the team have reviewed manuscripts for Bioinformatics, BMC Bioinformatics, Computational and Structural Biotechnology Journal, Computational Biology and Chemistry, Evolutionary Bioinformatics, Journal of Computational Chemistry, Journal of Chemical Information and Modeling, Journal of Molecular Graphics and Modeling, Nucleic Acids Research, PLoS One, Proteins: Structure, Function & Bioinformatics, and Structure.
Marie-Dominique Devignes gave a presentation on accelerating precision medicine to the OLINK proteomics society in Uppsala, Sweden.
Malika Smaïl-Tabbone gave a talk on “Integrative Machine Learning Applied on Drug Side Effect Profiles” at WCP2018 (18th World Congress of Basic and Clinical Pharmacology) in Kyoto, Japan.
Isaure Chauvot de Beauchêne gave a presentation on docking by combinatorial assembly of fragments to the Centre de mathématiques et de leurs applications (CMLA) at ENS Cachan.
Marie-Dominique Devignes reviewed grant applications for the ANR programme “Appel générique-JCJC”.
Malika Smaïl-Tabbone reviewed grant applications for the ANR and for the Indo-French Centre for the Promotion of Advanced Research.
Sabeur Aridhi reviewed grant applications for the French Committee for the Evaluation of Academic and Scientific Cooperation with Brazil (COFECUB).
David Ritchie is a member of the Bureau of the GGMM (Groupe de Graphisme et Modélisation Moléculaire).
Isaure Chauvot de Beauchêne is the team's representative to the ELIXIR 3D-Bioinfo working group, which is an essential link between the national IFB and the European ELIXIR projects that aim to make bioinformatics software and platforms available to non-expert biologists.
Marie-Dominique Devignes is Chargée de Mission for the CyberBioHealth research axis at the LORIA and is a member of the “Comipers” recruitment committee for Inria Nancy – Grand Est.
David Ritchie is a member of the Commission de Mention Informatique (CMI) of the University of Lorraine's IAEM doctoral school. Until November 2018 he was a member of the Bureau of the Project Committee for Inria Nancy – Grand Est.
Sabeur Aridhi is responsible for the major in IAMD (Ingénierie et Applications des Masses de Données) at TELECOM Nancy (Univ. Lorraine), and a member of the “Commission du Développement Technologique” recruitment committee at Inria Nancy – Grand Est.
Licence: Sabeur Aridhi, Programming Techniques and Tools, 24 hours, L1, Univ Lorraine.
Licence: Sabeur Aridhi, Databases, 82 hours, L1, Univ Lorraine.
Licence: Sabeur Aridhi, Massive Data Management, 68 hours, L2, Univ Lorraine.
Licence: Sabeur Aridhi, NoSQL Databases, 44 hours, L2, Univ Lorraine.
Licence: Sabeur Aridhi, Big Data Hackathon, 8 hours, L3, Univ Lorraine.
Licence: Marie-Dominique Devignes, Relational Database Design and SQL, 30 hours, L3, Univ Lorraine.
Licence: Isaure Chauvot de Beauchêne, TD Bioinformatique et Modelisation, 10 hours, L3, Univ Lorraine.
Licence: Malika Smaïl-Tabbone, Relational Databases, 90 hours, L2, L3, Univ Lorraine.
Licence: Malika Smaïl-Tabbone, NoSQL Databases, 30 hours, M1, Univ Lorraine.
Licence: Malika Smaïl-Tabbone, Programming Techniques, 30 hours, L2, Univ Lorraine.
Master: Malika Smaïl-Tabbone, KDD and Data Mining Algorithms, 90 hours, M2, Univ Lorraine.
Master: Malika Smaïl-Tabbone, Databases : Concepts and Techniques, 30 hours, M2, Univ Lorraine.
Master: Malika Smaïl-Tabbone, Ontology Management and Semantic Web Technologies, 30 hours, M2, Univ Lorraine.
Master: Sabeur Aridhi, Knowledge Discovery and Data Engineering, 10 hours, M2, Univ Lorraine.
PhD in progress: Maria Elisa Ruiz Echartea, Multi-component protein assembly using distance constraints, 01/11/2016, David Ritchie, Isaure Chauvot de Beauchêne.
PhD in progress: Kévin Dalleau, Complex graph analysis for classification: application to disease nosography, 01/12/2016, Malika Smaïl-Tabbone, Miguel Couerceiro.
PhD in progress: Bishnu Sarker, Developing distributed graph-based approaches for large-scale protein function annotation and knowledge discovery, 01/11/2017, David Ritchie, Sabeur Aridhi.
PhD in progress: Antoine Moniot, Modeling protein / nucleic acid complexes by combinatorial structural fragment assembly, 01/11/2017, David Ritchie, Isaure Chauvot de Beauchêne.
PhD in progress: Athénais Vaginay, Model selection and analysis for biological networks: use of domain knowledge and application to networks disturbed in diseases,
01/11/2017, Taha Boukhobza, Malika Smaïl-Tabbone.