Section: New Results

Protein Shape Matching and Family Identification

Using Dominances for Solving the Protein Family Identification Problem

Participant : Noël Malod-Dognin.

In collaboration with R. Andonov (IRISA), M. Le Boudic-Jamin (IRISA) and P. Kamath (former summer intern within the Symbiose project at IRISA).

The 3D structure of macro-molecules underpins all biological functions. Similarities between protein structures may come from evolutionary relationships, and similar protein structures relate to similar functions.

The exponential growth of the number of known protein structures in the Protein Data Bank over the past decade led to the problem of protein classification. We mean here how to automatically insert new protein structures into an already existing classified database 𝒬 = {q 1 , q 2 , ,q m } such as CATH or SCOP. The problem of determining in which classes new structures 𝒫={p 1 , p 2 , , p n } belong, according to a similarity function S:𝒬×𝒫 + , is referred here as the Protein Family Identification Problem (FIP).

There are computational pitfalls in the FIP . The number of similarity scores S(q i ,p j ) that need to be computed is |𝒬|×|𝒫|, where |𝒫| can be very large (there are currently 152920 classified protein structures in the expert classification CATH). Moreover, computing a single similarity score is often equivalent to solving a NP-hard problem (ex: DALI, DAST, CMO, VAST, etc...).

In [17] and [18] , we propose a notion of dominance between the protein structure comparison instances that allows the computation of optimal FIP without optimally solving all the comparison instances, and thus reduces the effect of the NP-Hardness of the similarity score.