<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN" "http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
    <title>Project-Team:CAPSID</title>
    <link rel="stylesheet" href="../static/css/raweb.css" type="text/css"/>
    <meta name="description" content="Research Program - Classifying and Mining Protein Structures and&#10;Protein Interactions"/>
    <meta name="dc.title" content="Research Program - Classifying and Mining Protein Structures and&#10;Protein Interactions"/>
    <meta name="dc.subject" content=""/>
    <meta name="dc.publisher" content="INRIA"/>
    <meta name="dc.date" content="(SCHEME=ISO8601) 2017-01"/>
    <meta name="dc.type" content="Report"/>
    <meta name="dc.language" content="(SCHEME=ISO639-1) en"/>
    <meta name="projet" content="CAPSID"/>
    <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
      <!--MathJax-->
    </script>
  </head>
  <body>
    <div class="tdmdiv">
      <div class="logo">
        <a href="http://www.inria.fr">
          <img style="align:bottom; border:none" src="../static/img/icons/logo_INRIA-coul.jpg" alt="Inria"/>
        </a>
      </div>
      <div class="TdmEntry">
        <div class="tdmentete">
          <a href="uid0.html">Project-Team Capsid</a>
        </div>
        <span>
          <a href="uid1.html">Personnel</a>
        </span>
      </div>
      <div class="TdmEntry">Overall Objectives<ul><li><a href="./uid3.html">Computational Challenges in Structural Biology</a></li></ul></div>
      <div class="TdmEntry">Research Program<ul><li class="tdmActPage"><a href="uid9.html&#10;&#9;&#9;  ">Classifying and Mining Protein Structures and
Protein Interactions</a></li><li><a href="uid14.html&#10;&#9;&#9;  ">Integrative Multi-Component Assembly and Modeling</a></li></ul></div>
      <div class="TdmEntry">Application Domains<ul><li><a href="uid24.html&#10;&#9;&#9;  ">Biomedical Knowledge Discovery</a></li><li><a href="uid25.html&#10;&#9;&#9;  ">Prokaryotic Type IV Secretion Systems</a></li><li><a href="uid26.html&#10;&#9;&#9;  ">Protein-RNA Interactions</a></li></ul></div>
      <div class="TdmEntry">
        <a href="./uid28.html">Highlights of the Year</a>
      </div>
      <div class="TdmEntry">New Software and Platforms<ul><li><a href="uid30.html&#10;&#9;&#9;  ">Hex</a></li><li><a href="uid34.html&#10;&#9;&#9;  ">Kbdock</a></li><li><a href="uid38.html&#10;&#9;&#9;  ">Kpax</a></li><li><a href="uid41.html&#10;&#9;&#9;  ">Sam</a></li><li><a href="uid46.html&#10;&#9;&#9;  ">gEMfitter</a></li><li><a href="uid50.html&#10;&#9;&#9;  ">ECDM</a></li><li><a href="uid53.html&#10;&#9;&#9;  ">GODM</a></li><li><a href="uid56.html&#10;&#9;&#9;  ">BLADYG</a></li><li><a href="uid59.html&#10;&#9;&#9;  ">Platforms</a></li></ul></div>
      <div class="TdmEntry">New Results<ul><li><a href="uid63.html&#10;&#9;&#9;  ">Drug Targeting and Adverse Drug Side Effects</a></li><li><a href="uid64.html&#10;&#9;&#9;  ">Docking Symmetrical Protein Structures</a></li><li><a href="uid65.html&#10;&#9;&#9;  ">Multiple Flexible Protein Structure Alignments</a></li><li><a href="uid66.html&#10;&#9;&#9;  ">Large-Scale Annotation of Protein Domains and Sequences</a></li><li><a href="uid67.html&#10;&#9;&#9;  ">Distributed Protein Graph Processing</a></li></ul></div>
      <div class="TdmEntry">Partnerships and Cooperations<ul><li><a href="uid69.html&#10;&#9;&#9;  ">Regional Initiatives</a></li><li><a href="uid74.html&#10;&#9;&#9;  ">National Initiatives</a></li><li><a href="uid79.html&#10;&#9;&#9;  ">International Initiatives</a></li><li><a href="uid86.html&#10;&#9;&#9;  ">International Research Visitors</a></li></ul></div>
      <div class="TdmEntry">Dissemination<ul><li><a href="uid90.html&#10;&#9;&#9;  ">Promoting Scientific Activities</a></li><li><a href="uid103.html&#10;&#9;&#9;  ">Teaching - Supervision - Juries</a></li><li><a href="uid119.html&#10;&#9;&#9;  ">Popularization</a></li></ul></div>
      <div class="TdmEntry">
        <div>Bibliography</div>
      </div>
      <div class="TdmEntry">
        <ul>
          <li>
            <a id="tdmbibentmajor" href="bibliography.html">Major publications</a>
          </li>
          <li>
            <a id="tdmbibentyear" href="bibliography.html#year">Publications of the year</a>
          </li>
          <li>
            <a id="tdmbibentfoot" href="bibliography.html#References">References in notes</a>
          </li>
        </ul>
      </div>
    </div>
    <div id="main">
      <div class="mainentete">
        <div id="head_agauche">
          <small><a href="http://www.inria.fr">
	    
	    Inria
	  </a> | <a href="../index.html">
	    
	    Raweb 
	    2017</a> | <a href="http://www.inria.fr/en/teams/capsid">Presentation of the Project-Team CAPSID</a> | <a href="http://capsid.loria.fr/">CAPSID Web Site
	  </a></small>
        </div>
        <div id="head_adroite">
          <table class="qrcode">
            <tr>
              <td>
                <a href="capsid.xml">
                  <img style="align:bottom; border:none" alt="XML" src="../static/img/icons/xml_motif.png"/>
                </a>
              </td>
              <td>
                <a href="capsid.pdf">
                  <img style="align:bottom; border:none" alt="PDF" src="IMG/qrcode-capsid-pdf.png"/>
                </a>
              </td>
              <td>
                <a href="../capsid/capsid.epub">
                  <img style="align:bottom; border:none" alt="e-pub" src="IMG/qrcode-capsid-epub.png"/>
                </a>
              </td>
            </tr>
            <tr>
              <td/>
              <td>PDF
</td>
              <td>e-Pub
</td>
            </tr>
          </table>
        </div>
      </div>
      <!--FIN du corps du module-->
      <br/>
      <div class="bottomNavigation">
        <div class="tail_aucentre">
          <a href="./uid3.html" accesskey="P"><img style="align:bottom; border:none" alt="previous" src="../static/img/icons/previous_motif.jpg"/> Previous | </a>
          <a href="./uid0.html" accesskey="U"><img style="align:bottom; border:none" alt="up" src="../static/img/icons/up_motif.jpg"/>  Home</a>
          <a href="./uid14.html" accesskey="N"> | Next <img style="align:bottom; border:none" alt="next" src="../static/img/icons/next_motif.jpg"/></a>
        </div>
        <br/>
      </div>
      <div id="textepage">
        <!--DEBUT2 du corps du module-->
        <h2>Section: 
      Research Program</h2>
        <h3 class="titre3">Classifying and Mining Protein Structures and
Protein Interactions</h3>
        <a name="uid10"/>
        <h4 class="titre4">Context</h4>
        <p>The scientific discovery process is very often based on cycles of
measurement, classification, and generalisation.
It is easy to argue that this is especially true in the biological sciences.
The proteins that exist today represent the molecular
product of some three billion years of evolution. Therefore, comparing protein
sequences and structures is important for understanding their functional and
evolutionary relationships <a href="./bibliography.html#capsid-2017-bid2">[66]</a>, <a href="./bibliography.html#capsid-2017-bid3">[41]</a>.
There is now overwhelming evidence that all living organisms
and many biological processes share a common ancestry in the tree of life.
Historically, much of bioinformatics research has focused on developing mathematical
and statistical algorithms to process, analyse, annotate, and compare protein and DNA
sequences because such sequences represent the primary form of information in
biological systems.
However, there is growing evidence that structure-based
methods can help to predict networks of protein-protein interactions (PPIs)
with greater accuracy than
those which do not use structural evidence <a href="./bibliography.html#capsid-2017-bid4">[45]</a>, <a href="./bibliography.html#capsid-2017-bid5">[71]</a>.
Therefore, developing techniques which can
mine knowledge of protein structures and their interactions
is an important way to enhance our knowledge of biology <a href="./bibliography.html#capsid-2017-bid6">[30]</a>.</p>
        <a name="uid11"/>
        <h4 class="titre4">Quantifying Structural Similarity</h4>
        <p>Often, proteins may be divided into modular sub-units called domains,
which can be associated with specific biological functions.
Thus, a protein domain may be considered as the evolutionary unit of biological
structure and function <a href="./bibliography.html#capsid-2017-bid7">[70]</a>.
However, while it is well known
that the 3D structures of protein domains are often more
evolutionarily conserved than their one-dimensional (1D) amino acid sequences,
comparing 3D structures is much more difficult than comparing 1D sequences.
However, until recently, most evolutionary studies of proteins have compared and clustered
1D amino acid and nucleotide sequences rather than 3D molecular structures.</p>
        <p>A pre-requisite for the accurate comparison of protein structures
is to have a reliable method for quantifying the
structural similarity between pairs of proteins.
We recently developed a new protein structure alignment program
called Kpax
which combines an efficient dynamic programming based
scoring function with a simple but novel Gaussian representation of
protein backbone shape
<a href="./bibliography.html#capsid-2017-bid8">[59]</a>.
This means that we can now quantitatively compare 3D protein domains at a
similar rate to throughput to conventional protein sequence comparison algorithms.
We recently compared Kpax with a large number of other structure
alignment programs, and we found Kpax to be the fastest and amongst the most accurate,
in a CATH family recognition
test <a href="./bibliography.html#capsid-2017-bid9">[48]</a>.
The latest version of Kpax <a href="./bibliography.html#capsid-2017-bid10">[9]</a> can calculate multiple flexible
alignments, and thus promises to avoid such issues when comparing more distantly
related protein folds and fold families.</p>
        <a name="uid12"/>
        <h4 class="titre4">Formalising and Exploiting Domain Knowledge</h4>
        <p>Concerning protein structure classification, we aim to
explore novel classification paradigms to circumvent the problems encountered
with existing hierarchical classifications of protein folds and domains.
In particular it will be interesting to set up fuzzy clustering methods
taking advantage of our previous work on gene functional classification
<a href="./bibliography.html#capsid-2017-bid11">[36]</a>,
but instead using Kpax domain-domain similarity matrices.
A non-trivial issue with fuzzy clustering is how to handle similarity rather
than mathematical distance matrices,
and how to find the optimal number of clusters,
especially when using a non-Euclidean similarity measure.
We will adapt the algorithms and the calculation of quality indices to the
Kpax similarity measure.
More fundamentally, it will be necessary to integrate this classification
step in the more general process leading from data to knowledge
called Knowledge Discovery in Databases (KDD)
<a href="./bibliography.html#capsid-2017-bid12">[39]</a>.</p>
        <p>Another example where domain knowledge can be useful is
during result interpretation: several sources of knowledge have to be used
to explicitly characterise each cluster and to help decide its validity.
Thus, it will be useful to be able to express data models, patterns, and rules in
a common formalism using a defined vocabulary for concepts and relationships.
Existing approaches such as the Molecular Interaction (MI) format  <a href="./bibliography.html#capsid-2017-bid13">[42]</a>
developed by the Human Genome Organization (HUGO) mostly address the experimental
wet lab aspects leading to data production and curation  <a href="./bibliography.html#capsid-2017-bid14">[53]</a>.
A different point of view is represented in the Interaction Network Ontology
(INO),
a community-driven ontology that aims to standardise and integrate data on interaction
networks and to support computer-assisted
reasoning  <a href="./bibliography.html#capsid-2017-bid15">[73]</a>.
However, this ontology does not integrate basic 3D concepts and structural relationships.
Therefore, extending such formalisms and symbolic relationships will be beneficial,
if not essential, when classifying the 3D shapes of proteins at the domain family level.</p>
        <a name="uid13"/>
        <h4 class="titre4">3D Protein Domain Annotation and Shape Mining</h4>
        <p>A widely used collection of protein domain families is “Pfam”
<a href="./bibliography.html#capsid-2017-bid16">[38]</a>, constructed from multiple alignments of protein sequences.
Integrating domain-domain similarity measures with knowledge about domain
binding sites, as introduced by us in our KBDOCK approach
<a href="./bibliography.html#capsid-2017-bid17">[1]</a>, <a href="./bibliography.html#capsid-2017-bid18">[3]</a>,
can help in selecting interesting subsets of domain pairs before clustering.
Thanks to our KBDOCK and Kpax projects,
we already have a rich set of tools with which we can start to process and
compare all known protein structures and PPIs according to their
component Pfam domains.
Linking this new classification to the latest “SIFTS”
(Structure Integration with Function, Taxonomy and Sequence) <a href="./bibliography.html#capsid-2017-bid19">[67]</a>
functional annotations between standard Uniprot (<a href="http://www.uniprot.org/">http://www.uniprot.org/</a>
sequence identifiers and protein structures from the
Protein Data Bank (PDB) <a href="./bibliography.html#capsid-2017-bid20">[29]</a>
could then provide a useful way to discover new structural and functional
relationships which are difficult to detect in existing classification
schemes such as CATH or SCOP.
As part of the thesis project of Seyed Alborzi, we developed
a recommender-based data mining technique to associate enzyme classification code numbers
with Pfam domains using our recently developed EC-DomainMiner program
<a href="./bibliography.html#capsid-2017-bid21">[11]</a>.
We subsequently generalised this approach as a tripartite graph mining method for inferring
associations between different protein annotation sources,
which we call “CODAC” (for COmputational Discovery of Direct Associations using Common Neighbours).
A first paper on this approach was presented at IWBBIO-2017 <a href="./bibliography.html#capsid-2017-bid22">[23]</a>.</p>
      </div>
      <!--FIN du corps du module-->
      <br/>
      <div class="bottomNavigation">
        <div class="tail_aucentre">
          <a href="./uid3.html" accesskey="P"><img style="align:bottom; border:none" alt="previous" src="../static/img/icons/previous_motif.jpg"/> Previous | </a>
          <a href="./uid0.html" accesskey="U"><img style="align:bottom; border:none" alt="up" src="../static/img/icons/up_motif.jpg"/>  Home</a>
          <a href="./uid14.html" accesskey="N"> | Next <img style="align:bottom; border:none" alt="next" src="../static/img/icons/next_motif.jpg"/></a>
        </div>
        <br/>
      </div>
    </div>
  </body>
</html>
