One of the key challenges in the study of biological systems is understanding how the static information recorded in the genome is interpreted to become dynamic systems of cooperating and competing biomolecules. Magnome addresses this challenge through the development of informatic techniques for multi-scale modeling and large-scale comparative genomics:
logical and object models for knowledge representation
stochastic hierarchical models for behavior of complex systems, formal methods
algorithms for sequence analysis, and
data mining and classification.
We use genome-scale comparisons of eukaryotic organisms to build modular and hierarchical hybrid models of cell behavior that are studied using multi-scale stochastic simulation and formal methods. Our research program builds on our experience in comparative genomics, modeling of protein interaction networks, and formal methods for multi-scale modeling of complex systems.
New high-throughput technologies for DNA sequencing have radically reduced the cost of acquiring genome and transcriptome data, and introduced new strategies for whole genome sequencing. The result has been an increase in data volumes of several orders of magnitude, as well has a greatly increased density of genome sequences within phylogenetically constrained groups of species. Magnome develops efficient techniques for dealing with these increased data volumes, and the combinatorial challenges of dense multi-genome comparison.
Fundamental questions in the life sciences can now be addressed at an unprecedented scale through the combination of high-throughput experimental techniques and advanced computational methods from the computer sciences. The new field of computational biology or bioinformatics has grown around intense collaboration between biologists and computer scientists working towards understanding living organisms as systems. One of the key challenges in this study of systems biology is understanding how the static information recorded in the genome is interpreted to become dynamic systems of cooperating and competing biomolecules.
Magnome addresses this challenge through the development of informatic techniques for understanding the structure and history of eukaryote genomes: algorithms for genome analysis, data models for knowledge representation, stochastic hierarchical models for behavior of complex systems, and data mining and classification. Our work is in methods and algorithms for:
Genome annotation for complete genomes, performing syntactic analyses to identify genes, and semantic analyses to map biological meaning to groups of genes , , , , , .
Integration of heterogenous data, to build complete knowledge bases for storing and mining information from various sources, and for unambiguously exchanging this information between knowledge bases , , , , .
Ancestor reconstruction using optimization techniques, to provide plausible scenarios of the history of genome evolution , , , .
Classification and logical inference, to reliably identify similarities between groups of genetic elements, and infer rules through deduction and induction , , .
Hierarchical and comparative modeling, to build mathematical models of the behavior of complex biological systems, in particular through combination, reutilization, and specialization of existing continuous and discrete models , , , , .
The hundred- to thousand-fold decrease in sequencing costs seen in the past few years presents significant challenges for data management and large-scale data mining. Magnome's methods specifically address “scaling out,” where resources are added by installing additional computation nodes, rather than by adding more resources to existing hardware. Scaling out adds capacity and redundancy to the resource, and thus fault tolerance, by enforcing data redundancy between nodes, and by reassigning computations to existing nodes as needed.
The central dogma of evolutionary biology postulates that contemporary genomes evolved from a common ancestral genome, but the large scale study of their evolutionary relationships is frustrated by the unavailability of these ancestral organisms that have long disappeared. However, this common inheritance allows us to discover these relationships through comparison, to identify those traits that are common and those that are novel inventions since the divergence of different lineages.
We develop efficient methodologies and software for associating biological information with complete genome sequences, in the particular case where several phylogenetically-related eukaryote genomes are studied simultaneously.
The methods designed by Magnome for comparative genome annotation, structured genome comparison, and construction of integrated models are applied on a large scale to:
eukaryotes from the hemiascomycete class of yeasts , , , , , and to
prokaryotes from the lactic bacteria used in winemaking , , , , , .
A general goal of systems biology is to acquire a detailed quantitative understanding of the dynamics of living systems. Different formalisms and simulation techniques are currently used to construct numerical representations of biological systems, and a recurring challenge is that hand-tuned, accurate models tend to be so focused in scope that it is difficult to repurpose them. We claim that, instead of modeling individual processes de novo, a sustainable effort in building efficient behavioral models must proceed incrementally. Hierarchical modeling is one way of combining specific models into networks. Effective use of hierarchical models requires both formal definition of the semantics of such composition, and efficient simulation tools for exploring the large space of complex behaviors. We have combined uses theoretical results from formal methods and practical considerations from modeling applications to define BioRica , , , a framework in which discrete and continuous models can communicate with a clear semantics. Hierarchical models in BioRica can be assembled from existing models, and translated into their execution semantics and then simulated at multiple resolutions through multi-scale stochastic simulation. BioRica models are compiled into a discrete event formalism capable of capturing discrete, continuous, stochastic, non deterministic and timed behaviors in an integrated and non-ambiguous way. Our long-term goal to develop a methodology in which we can assemble a model for a species of interest using a library of reusable models and a organism-level “schematic” determined by comparative genomics.
Comparative modeling is also a matter of reconciling experimental data with models and inferring new models through a combination of comparative genomics and successive refinement , .
Yeasts provide an ideal subject matter for the study of eukaryotic microorganisms. From an experimental standpoint, the yeast Saccharomyces cerevisiae is a model organism amenable to laboratory use and very widely exploited, resulting in an astonishing array of experimental results. From a genomic standpoint, yeasts from the hemiascomycete class provide a unique tool for studying eukaryotic genome evolution on a large scale. With their relatively small and compact genomes, yeasts offer a unique opportunity to explore eukaryotic genome evolution by comparative analysis of several species. MAGNOME applies its methods for comparative genomics and knowledge engineering to the yeasts through the ten-year old Génolevures program (GDR 2354 CNRS), devoted to large-scale comparisons of yeast genomes with the aim of addressing basic questions of molecular evolution.
We developed the software tools used by the CNRS's http://
Oleaginous yeasts are capable of synthesizing lipids from different substrates other than glucose, and current research is attempting to understand this conversions with the goal of optimizing their throughput, production and quality. From a genomic standpoint the objective is to characterize genes involved in the biosynthesis of precursor molecules which will be transformed into fuels, which are thus not derived from petroleum. Magnome's focus is in acquiring genome sequences, predicting genes using models learned from genome comparison and sequencing of cDNA transcripts, and comparative annotation. Our overall goal is to define dynamic models that can be used to predict the behavior of modified strains and thus drive selection and genetic engineering.
Yeasts and bacteria are essential for the winemaking process, and selection of strains based both on their efficiency and on the influence on the quality of wine is a subject of significant effort in the Aquitaine region. Unlike the species studied above, yeast and bacterial starters for winemaking cannot be genetically modified. In order to propose improved and more specialized starters, industrial producers use breeding and selection strategies.
Comparative genomics is a powerful tool for strain selection even when genetic engineering must be excluded. Large-scale comparison of the genomes of experimentally characterized strains can be used to identify quantitative trait loci, which can be used as markers in selective breeding strategies. Identifying individual SNPs and predicting their effect can lead to better understanding of the function of genes implicated in improved strain performance, particularly when those genes are naturally mutated or are the result of the transfer of genetic material from other strains. And understanding the combined effect of groups of genes or alleles can lead to insight in the phenomenon of heterosis.
Affinity binders are molecular tools for recognizing protein targets, that play a fundamental in proteomics and clinical diagnostics. Large catalogs of binders from competing technologies (antibodies, DNA/RNA aptamers, artificial scaffolds, etc.) and Europe has set itself the ambitious goal of establishing a comprehensive, characterized and standardized collection of specific binders directed against all individual human proteins, including variant forms and modifications. Despite the central importance of binders, they presently cover only a very small fraction of the proteome, and even though there are many antibodies against some targets (for example,
The MAGUS genome annotation system integrates genome sequences and sequences features, in silico analyses, and views of external data resources into a familiar user interface requiring only a Web navigator. MAGUS implements annotation workflows and enforces curation standards to guarantee consistency and integrity. As a novel feature the system provides a workflow for simultaneous annotation of related genomes through the use of protein families identified by in silico analyses; this results in an
Pantograph is a software tool for inferring whole-genome metabolic models for eukaryote cell factories. It is based on metabolic scaffolds, abstract descriptions of reactions and pathways on which inferred reactions are hung are are eventually connected by an interative mapping and specialization process. Scaffold fragments can be repeatedly used to build specialized subnetworks of the complete model.
A novel feature of Pantograph is that it uses expert knowledge implicitly encoded in the scaffold's gene associations, and explicitly transfers this knowledge to the new model.
Pantograph is available under an open-source license.
For more information see the Pantograph Gforge web site.
The metabolic model generalization and navigation software allows a human expert to explore a metabolic model in a layered manner. The software creates an on-line semantically zoomable representation of a model submitted by the user in SBML
BioRica is a high-level modeling framework integrating discrete and continuous multi-scale dynamics within the same semantics field. A model in BioRica node is hierarchically composed of nodes, which may be existing models. Individual nodes can be of two types:
Discrete nodes are composed of states and transitions described by guarded events. Behavior can be stochastic (defined by the likelihood that an event fires when activated) and timed (defined by the delay between an event's activation and the moment that its transition occurs).
Continuous nodes are described by ODE systems, potentially a hybrid system whose internal state flows continuously while having discrete jumps.
The system has been implemented as a distributable software package.
The BioRica compiler reads a specification for hierarchical model and compiles it into an executable simulator. The modeling language is a stochastic extension to the AltaRica
The Génolevures online database provides tools and data for exploring the annotated genome sequences of more than 20 genomes, determined and manually annotated by the Génolevures Consortium to facilitate comparative genomic studies of hemiascomycetous yeasts. Data are presented with a focus on relations between genes and genomes: conservation of genes and gene families, speciation, chromosomal reorganization and synteny. The Génolevures site includes a private collaboration area for specific studies by members of its international community.
The contents of the knowledge base are expanded and maintained by the CNRS through GDR 2354 Génolevures, and full data may be downloaded from the site.
Génolevures online uses our open-source MAGUS system for genome navigation, with project-specific extensions developed by David Sherman, Pascal Durrens, and Tiphaine Martin; these extensions are not made available due to incertainty about intellectual property rights.
For more information see the Génolevures web site.
Inria Bioscience Resources is a portal designed to improve the visibility of bioinformatics tools and resources developed by Inria teams. This portal will help the community of biologists and bioinformatians understand the variety of bioinformatics projects in Inria, test the different applications, and contact project-teams. Eight project-teams participate in the development of this portal. Inria Bioscience Resources is developed in an Inria Technology Development Action (ADT).
Analyses in comparative genomics are characteristically forms of datamining in high-dimension sets of relations between genes and gene products. For every linear increase in genomic data, these relations can grow at worst geometrically.
Natalia Golenetskaya's thesis developed an integrated architecture that we call Tsvetok, which combines a novel NoSQL storage schema, domain-specific Map-Reduce algorithms, and existing resources to efficiently handle the fundamentally data-parallel analyses encountered in comparative genomics , , . Tsvetok components are deployed in Magnome's private cloud and have been extensively tested using data and use cases derived from log analyses of the Génolevures web resource. We designed Map-Reduce solutions for the principal whole-genome analyses used by Magnome for comparative genomics, in particular new distributed algorithms for systematic identification of gene fusion and fission events in eukaryote genomes, and large-scale consensus clustering for protein families. These examples illustrate two strategies that can be used to scale algorithms in a Map-Reduce setting.
Converting classical graph-based algorithms with message propagation: instead of traversing a graph, which would incur high latency, information is sent forward in waves, and synchronized later. Some of the intermediate computations may be redundant, but overall running time is minimized.
Iterative sampling strategies, which run the standard algorithm on carefully chosen subsets, and later compute a consensus of the intermediate results. The iterations may take some time to converge, but the individual instances can be run within one machine.
Florian Lajus extended the Magus software platform to use the NoSQL
storage components in Tsvetok, and has validated it on a large
collection of fungal genomes.
Xavier Calcas is currently integrating the Galaxy
platform
The Pantograph approach uses an annotated “scaffold” (reference) model and a collection of complementary predictions of homology between scaffold genes and target genes. The basis of the method is a weighing of the homology evidence to decide whether a reaction that is present in the scaffold ought be be present in the target.
We have improved on the method in two ways. First, we model the
implicit knowledge represented in the boolean formula of each gene
association, to derive hypotheses about the explicit role of
individual genes; for example, a gene association
Second, we have adopted an abductive strategy for inferring reactions. In this strategy we consider that it is the reactions that explain the genes observed in the target genome. In the corresponding abductive logic program, the observations are the genes in the target, the integrity constraints are the rules that rewrite gene associations, and the hypotheses to be abduced are the reactions in the model. The scaffold model is compiled into a set of facts and predicates that express the reactions, their gene associations, and the integrity constraint rules; the abducibles generate assertions that specific reactions are in the target model. Combined with the facts of the genes observed in the target, this program generates, through abduction, the set of target reactions that explain the greatest number of genes.
The advantage of this approach is that it can invent, through specialization, reactions that are not present per se in the scaffold model.
There is an inherent tension between detail and understandability in large metabolic networks: detailed description of individual reactions is needed for simulation, but high-level views of reactions are needed for describing pathways in human terms.
We defined knowledge-based methods that factor similar reactions into “generic” reactions in order to visualize a whole pathway or compartment, while maintaining the underlying model so that the user can later “drill down” to the specific reactions if need be, ,
This method is available as a Python library from http://
Figures and illustrate model generation for Yarrowia lypolitica fatty acid oxidation in the peroxisome. Molecular species are represented as circular nodes, and the reactions as square ones, connected by edges to their reactants and products. Ubiquitous species (e.g. oxygen, water, ATP) are of smaller size and colored gray. Non-ubiquitous species are divided into fifteen equivalence classes and colored accordingly (red/blue for trivial species/reaction equivalence classes, different colors for non-trivial equivalence classes). The size of the model does not allow for readability of the species labels, thus we do not show them (figure ).
The generalization algorithm identifies equivalent molecular species using an ontology, and groups together reactions that operate on the same abstract species. It finds the greatest generalization the preserves stoichiometry. The generalized model represents quotient species and reactions. For example, the violet unsaturated FA-CoA node is a quotient of hexadec-2-enoyl-CoA, oleoyl-CoA, tetradecenoyl-CoA, trans-dec-2-enoyl-CoA, trans-dodec-2-enoyl-CoA, trans-hexacos-2-enoyl-CoA, trans-octadec-2-enoyl-CoA, and trans-tetradec-2-enoyl-CoA (colored violet in figure ). In a similar manner, the light-green acCoA oxidase quotient reaction, that converts fatty acyl-CoA (yellow) into unsaturated FA-CoA (violet), generalizes six corresponding light-green reactions of the initial model (figure ).
The generalized model describes
The specific model is appropriate for simulation, because it contains all of the precise reactions. The generalized model is suited for a human, because it reveals the main properties of the model and masks distracting details. For example, the generalized model highlights the fact that there is a particularity concerning C24:0-CoA (stearoyl-CoA) (red, inside the cycle): there exists a "shortcut" reaction (blue, inside the cycle), producing it directly from another fatty acyl-CoA (yellow), avoiding the usual four-reaction beta-oxidation chain, used for other fatty acyl-CoAs. This shortcut is not obvious in the specific model, because it is hidden among a plethora of similar-looking reactions.
In collaboration with Sven Saupe and Mathieu Paoletti from IBGC Bordeaux (ANR Mykimun), we worked on characterization of the STAND protein family in the fungal phylum. We established an in silico screen based on state-of-the-art bioinformatic tools, which – starting from experimentally studied sequences from Podospora anserina – allowed us to determine the first systematic picture of fungal STAND protein repertoire (ms. in preparation). Most notably, we found evidence of extensive modularity of domain associations, and signs of concerted evolution within the recognition domain. Both results support the hypothesis that fungal STAND proteins, originally described in the context of vegetative incompatibility, are involved in a general fungal immune system. In addition, we investigated improved protein domain representations and elaborated a grammatical modelling method , which will be used to elucidate mechanisms of formation and operation of the STAND proteins.
We previously formalized two strategies for integrating discrete control with continuous models, coefficient switches that control the parameters of the continuous model, and strong switches that choose different models , . While these strategies have proved useful for modeling hybrid systems in biotechnology and medicine , the resulting system model can be inefficient when the different subsystems evolve at very different time scales. In order to improve the efficiency of the resulting simulations, we investigated the use of Kofman's Quantized State Systems (QSS), and demonstrated that the QSS approach can be adapted to BioRica . On the strength of this demonstration, we invited Joaquin Fernandez from Kofman's lab to Magnome. Joaquin had previously implemented an efficient library for QSS simulation, and during his stay succeeded in adapting it to our hybrid modeling framework. In his approach, SBML models with events are compiled into a hybrid model, using a variant of Modelica for surface syntax and using the QSS library for efficient simulation.
Using Magnome's Magus system and YAGA software, we have successfully realized a full annotation and analysis of several groups of related genomes:
Seven new genomes, provided to the Génolevures Consortium by the CEA–Génoscope (Évry), including two distant genomes from the Saccharomycetales were annotated using previously published Génolevures genomes.
Twelve wine starter yeasts linked to fermentation efficiency.
Five pathogenic (to human) and non pathogenic Nakaseomycetes.
Two oleaginous strains with applications in biofuels.
Winemaking yeasts. In collaboration with partners in the ISVV, Bordeaux, we have assembled and analyzed 12 wine starter yeasts, with the goal of understanding genetic determinants of performance in wine fermentation. Analysis included identification of strain-specific gains and losses of genes linked both to niche specificity and to performance in industrial applications (article in prep.). A further combined analysis with 50 natural and industrial strains showed a pattern of introgression concentrated in industrial strains (article in prep.).
Oleaginous yeasts. In collaboration with Prof Jean-Marc Nicaud's lab at the INRA Grignon, we developed the first functional genome-scale metabolic model of Yarrowia lipolytica, an oleaginous yeast studied experimentally for its role as a food contaminant and its use in bioremediation and cell factory applications.
Using Magnome's Pantograph method (see section )
we produced an accurate functional model for Y. lipolytica, MODEL1111190000 in BioModels
Pathogenic yeasts. A further group of five species, comprised of pathogenic and nonpathogenic species, was analyzed with the goal of identifying virulence determinants . By choosing species that are highly related but which differ in the particular traits that are targeted, in this case pathogenicity, we are able to focus of the few hundred genes related to the trait . The approximately 40,000 new genes from these studies were classified into existing Génolevures families as well as branch-specific families.
Magnome and the company BioLaffort are contracted to develop analyses and tools for rationalizing wine starter strain selection using genomics.
The “SAGESS” project, below, section , has been partially funded by a grant to BioLaffort from the Region.
This project is a collaboration between the company BioLaffort, specialized in the selection of industrial yeasts with distinct technological abilities, with the ISVV and Magnome. The goal is to use genome analysis to identify molecular markers responsible for different physiological capabilities, as a tool for selecting yeasts and bacteria for wine fermentation through efficient hybridization and selection strategies. This collaboration has obtained the INNOVIN label.
Signal Transduction Associated with Numerous Domains (STAND) proteins play a central role in vegetative incompatibility (VI) in fungi. STAND proteins act as molecular switches, changing from closed inactive conformation to open active conformation upon binding of the proper ligand. Mykimun, coordinated by Mathieu Paoletti of the IBGC (Bordeaux), studies the postulated involvement of STAND proteins in heterospecific non self recognition (innate immune response).
In MYKIMUN we extend the notion of fungal immune receptors and immune reaction beyond the P. anserina NWD gene family. We develop in silico machine learning tools to identify new potential PRRs based on the expected characteristics of such genes, in P. anserina and beyond in additional sequenced fungal genomes. This should contribute to extend concept of a fungal immune system to the whole fungal branch of the eukaryote phylogenetic tree.
A major objective of the “post-genome” era is to detect, quantify and characterise all relevant human proteins in tissues and fluids in health and disease. This effort requires a comprehensive, characterised and standardised collection of specific ligand binding reagents, including antibodies, the most widely used such reagents, as well as novel protein scaffolds and nucleic acid aptamers. Currently there is no pan-European platform to coordinate systematic development, resource management and quality control for these important reagents.
Magnome is an associate partner of the FP7 “Affinity Proteome” project coordinated by Prof. Mike Taussig of the Babraham Institute and Cambridge University. Within the consortium, we participate in defining community for data representation and exchange, and evaluate knowledge engineering tools for affinity proteomics data.
Prof. Mike Taussig: Babraham Institute & Cambridge University
Knowledge engineering for Affinity Proteomics
Henning Hermjakob: European Bioinformatics Institute
Standards and databases for molecular interactions
Magnome participates in the CARNAGE associated team, coordinated by AMIB, with the Russian Academy of Sciences.
AMAVI
Program: Inria International Partner
Title: Combinatorics and Algorithms for the Genomic sequences
Inria principal investigators: David Sherman
International Partner (Institution - Laboratory - Researcher):
Vavilov Institute of General Genetics (Russia (Russian Federation)) - Department of Computational Biology - Vsevolod Makeev
Duration: 2010 - present
VIGG and AMIB teams has a more than 12 years long collaboration on sequence analysis. The two groups aim at identifying DNA motifs for a functional annotation, with a special focus on conserved regulatory regions. In the current 3-years project CARNAGE, our collaboration, that includes Inrai-team Magnome, is oriented towards new trends that arise from Next Generation Sequencing data. Combinatorial issues in genome assembly are addressed. RNA structure and interactions are also studied.
The toolkit is pattern matching algorithms and analytic combinatorics, leading to common software.
Magnome collaborates with Rodrigo Assar of the Universidad Andrès Bello, and Nicolás Loira and Alessandro Maass of the Center for Genomic Regulation, in Santiago de Chile (Chile).
Magnome and the VIGG of the Russian Academy of Sciences (RAS) in Moscow are partners in a project funded by the CNRS and the RAS entitle “Séquençage profond de organismes biotechnologiques : des régulons à l'adaptation ”.
Vsevolod MAKEEV November 8-22 2013
Artëm KASIANOV November 8-22 2013
Joaquin FERNANDEZ September-November 2013
Pascal Durrens is :
leader of the “Comparative Genomics” theme and member of the Scientific Council of the LaBRI UMR 5800/CNRS.
responsible for scientific diffusion for the Génolevures Consortium.
member of the editorial board of the journal ISRN Computational Biology, and was reviewer for the journal BMC Genomics
expert in Genomics for the Fonds de la Recherche Scientifique-FNRS (FRS-FNRS), Belgium
David Sherman is :
president of the Commission de Jeunes Chercheurs, Inria Bordeaux Sud-Ouest
member for Bordeaux Sud-Ouest of Inria's Young Scientists Mission
member of the editorial board of the journal Computational and Mathematical Methods in Medicine
Licence : Anna Zhukova, J1MI2013 : Algorithmes et Programmes TD/TP, 30h, L2, Université Bordeaux, France
PhD in progress: Anna Zhukova, “Knowledge engineering for biological networks,” 2011–, Sherman
PhD in progress: Razanne Issa, “Analyse symbolique de données génomiques,” 2010–, Sherman
David Sherman was a member of the juries of:
Natalia GOLENETSKAYA, “Addressing scaling challenges in comparative genomics,” U. Bordeaux, 2013-09-09
Boyang JI, “Comparative and Functional Genome Analysis of Magnetotactic Bacteria,” U. Aix-Marseille, 2013-10-23
Andres ARAVENA, “Probabilistic and constraint based modelling to determine regulation events from heterogeneous biological data,” U. Rennes, 2013-12-13
Magnome participated in « UniThé ou Café » in the Inria Bordeaux – Sud-Ouest research center.
Anna Zhukova animated one of the Inria workshops at the 2013 “Fête de la Science”
David Sherman is a member of the Inria Bordeaux – Sud-Ouest's “Scientific Culture” committee, which organizes and proposes various scientific popularization actions.