Cells are seen as the basic structural, functional and biological units of all living systems. They represent the smallest units of life that can replicate independently, and are often referred to as the building blocks of life. Living organisms are then classified into unicellular ones – this is the case of most bacteria and archea – or multicellular – this is the case of animals and plants. Actually, multicellular organisms, such as for instance human, may be seen as composed of native (human) cells, but also of extraneous cells represented by the diverse bacteria living inside the organism. The proportion in the number of the latter in relation to the number of native cells is believed to be high: this is for example of 90% in humans. Multicellular organisms have thus been described also as “superorganisms with an internal ecosystem of diverse symbiotic microbiota and parasites” (Nicholson et al., Nat Biotechnol, 22(10):1268-1274, 2004) where symbiotic means that the extraneous unicellular organisms (cells) live in a close, and in this case, long-term relation both with the multicellular organisms they inhabit and among themselves. On the other hand, bacteria sometimes group into colonies of genetically identical individuals which may acquire both the ability to adhere together and to become specialised for different tasks. An example of this is the cyanobacterium Anabaena sphaerica who may group to form filaments of differentiated cells, some – the heterocysts – specialised for nitrogen fixation while the others are capable of photosynthesis. Such filaments have been seen as first examples of multicellular patterning.
At its extreme, one could then see life as one collection, or a collection of collections of genetically identical or distinct self-replicating cells who interact, sometimes closely and for long periods of evolutionary time, with same or distinct functional objectives. The interaction may be at equilibrium, meaning that it is beneficial or neutral to all, or it may be unstable meaning that the interaction may be or become at some time beneficial only to some and detrimental to other cells or collections of cells. The interaction may involve living systems, or systems that have been described as being at the edge of life such as viruses, or else living systems and chemical compounds (environment). It also includes the interaction between cells within a multicellular organism, or between transposable elements and their host genome.
The application goal of ERABLE is, through the use of mathematical models and algorithms, to better understand such close and often persistent interactions, with a longer term objective of becoming able in some cases to suggest the means of controlling for or of re-establishing equilibrium in an interacting community by acting on its environment or on its players, how they play and who plays. This goal requires to identify who are the partners in a closely interacting community, who is interacting with whom, how and by which means. Any model is a simplification of reality, but once selected, the algorithms to explore such model should address questions that are precisely defined and, whenever possible, be exact in the answer as well as exhaustive when more than one exists in order to guarantee an accurate interpretation of the results within the given model. This fits well the mathematical and computational expertise of the team, and drives the methodological goal of ERABLE which is to substantially and systematically contribute to the field of exact enumeration algorithms for problems that most often will be hard in terms of their complexity, and as such to also contribute to the field of combinatorics in as much as this may help in enlarging the scope of application of exact methods.
The key objective is, by constantly crossing ideas from different models and types of approaches, to look for and to infer “patterns”, as simple and general as possible, either at the level of the biological application or in terms of methodology. This objective drives which biological systems are considered, and also which models and in which order, going from simple discrete ones first on to more complex continuous models later if necessary and possible.
ERABLE has two main goals, one related to biology and the other to methodology (algorithms, combinatorics, statistics). In relation to biology, the main goal of ERABLE is to contribute, through the use of mathematical models and algorithms, to a better understanding of close and often persistent interactions between “collections of genetically identical or distinct self-replicating cells” which will correspond to organisms/species or to actual cells. The first will cover the case of what has been called symbiosis, meaning when the interaction involves different species, while the second will cover the case of different collections of cells. One question in particular, but not exclusively, interests us. This is the one of a (cancerous) tumour which may be seen as a collection of cells which suddenly disrupts its interaction with the other (collections of) cells in an organism by starting to grow uncontrollably.
Such interactions are being explored initially at the molecular level. Although we rely as much as possible on already available data, we intend to also continue contributing to the identification and analysis of the main genomic and systemic (regulatory, metabolic, signalling) elements involved or impacted by an interaction, and how they are impacted. We started going to the population and ecological levels by modelling and analysing the way such interactions influence, and are or can be influenced by the ecosystem of which the “collections of cells” are a part. The key steps are:
One important longer term goal of the above is to analyse how the behaviour and dynamics of such a network of networks might be controlled by modifying it, including by subtracting some of its components from the network or by adding new ones.
In relation to methodology, the main goal is to provide those enabling to address our main biological objective as stated above that lead to the best possible interpretation of the results within a given pre-established model and a well defined question. Ideally, given such a model and question, the method is exact and also exhaustive if more than one answer is possible. Three aspects are thus involved here: establishing the model within which questions can and will be put; clearly defining such questions; exactly answering to them or providing some guarantee on the proximity of the answer given to the “correct” one. We intend to continue contributing to these three aspects:
The goals of the team are biological and methodological, the two being intrinsically linked. Any division into axes along one or the other aspect or a combination of both is thus somewhat artificial. Following the evaluation of the team at the end of 2017, four main axes were identified, with the last one being the more recently added one. This axis is specifically oriented towards health in general. The first three axes are: (pan)genomics and transcriptomics in general, metabolism and (post)transcriptional regulation, and (co)evolution.
Notice that the division itself is based on the biological level
(genomic, metabolic/regulatory, evolutionary) or main current Life Science purpose
(health) rather than on the mathematical or computational methodology involved. Any choice has its part of arbitrariness. Through the one we made, we wished to emphasise the fact that the area of application of ERABLE is important for us.
It does not mean that the mathematical and computational objectives are not equally important, but only that those are, most often,
motivated by problems coming from or associated to the general Life Science goal. Notice that such arbitrariness also means that some Life Science topics will be artificially split into two different Axes. One example of this is genomics and the main health areas currently addressed that are intrinsically inter-related.
Axis 1: (Pan)Genomics and transcriptomics in general
Intra and inter-cellular interactions involve molecular elements whose identification is crucial to understand what governs, and also what might enable to control such interactions. For the sake of clarity, the elements may be classified in two main classes, one corresponding to the elements that allow the interactions to happen by moving around or across the cells, and another that are the genomic regions where contact is established. Examples of the first are non coding RNAs, proteins, and mobile genetic elements such as (DNA) transposons, retro-transposons, insertion sequences, etc. Examples of the second are DNA/RNA/protein binding sites and targets. Furthermore, both types (effectors and targets) are subject to variation across individuals of a population, or even within a single (diploid) individual. Identification of these variations is yet another topic that we wish to cover. Variations are understood in the broad sense and cover single nucleotide polymorphisms (SNPs), copy-number variants (CNVs), repeats other than mobile elements, genomic rearrangements (deletions, duplications, insertions, inversions, translocations) and alternative splicings (ASs). All three classes of identification problems (effectors, targets, variations) may be put under the general umbrella of genomic functional annotation.
Axis 2: Metabolism and (post)transcriptional regulation
As increasingly more data about the interaction of molecular elements (among which those described above) becomes available, these should then be modelled in a subsequent step in the form of networks. This raises two main classes of problems. The first is to accurately infer such networks. Assuming such a network, integrated or “simple”, has been inferred for a given organism or set of organisms, the second problem is then to develop the appropriate mathematical models and methods to extract further biological information from such networks.
The team has so far concentrated its efforts on two main aspects concerning such interactions: metabolism and post-transcriptional regulation by small RNAs. The more special niche we have been exploring in relation to metabolism concerns the fact that the latter may be seen as an organism's immediate window into its environment. Finely understanding how species communicate through those windows, or what impact they may have on each other through them is thus important when the ultimate goal is to be able to model communities of organisms, for understanding them and possibly, on a longer term, for control. While such communication has been explored in a number of papers, most do so at a too high level or only considered couples of interacting organisms, not larger communities. The idea of investigating consortia, and in the case of synthetic biology, of using them, has thus started being developed in the last decade only, and was motivated by the fact that such consortia may perform more complicated functions than could single populations, as well as be more robust to environmental fluctuations. Another originality of the work that the team has been doing in the last decade has also been to fully explore the combinatorial aspects of the structures used (graphs or directed hypergraphs) and of the associated algorithms. As concerns post-transcriptional regulation, the team has essentially been exploring the idea that small RNAs may have an important role in the dialog between different species.
Axis 3: (Co)Evolution
Understanding how species that live in a close relationship with others may (co)evolve requires understanding
for how long symbiotic relationships are maintained or how they change
through time. This may have deep implications in some cases also for understanding how to
control such relationships, which may be a way of controlling the impact of symbionts on the host, or the
impact of the host on the symbionts and on the environment (by acting on its symbiotic partner(s)). These
relationships, also called symbiotic associations, have however not yet been very widely studied, at least not at a large
scale.
One of the problems is getting the data, meaning the trees for hosts and symbionts but even prior to that,
determining with which symbionts the present-day hosts are associated (or are “infected” by as may be the term used in some contexts) which is a big enterprise in itself.
The other problem is measuring the stability of the association.
This has generally been done
by concomitantly studying the phylogenies of hosts and symbionts, that is by doing what is called a
cophylogeny analysis, which itself is often realised by performing what is called a reconciliation of two
phylogenetic trees (in theory, it could be more than two but this is a problem that has not yet been addressed by the team), one for the symbionts and one for the hosts with which the symbionts are associated.
This consists in mapping one of the trees (usually, the symbiont tree) to the other.
Cophylogeny inherits all the difficulties of
phylogeny, among which the fact that it is not possible to check the result against the “truth” as this is now lost in the past.
Cophylogeny however also brings new problems of its own which are to estimate
the frequency of the different types of events that could lead to discrepant evolutionary
histories, and to estimate the duration of the associations such events may create.
Axis 4: Health in general
As indicated above, this is a recent axis in the team and concerns various applications to human and animal health. In some ways, it overlaps with the three previous axes as well as with Axis 5 on the methodological aspects, but since it gained more importance in the past few years, we decided to develop more these particular applications. Most of them started through collaborations with clinicians. Such applications are currently focused on three different topics: (i) Infectiology, (ii) Rare diseases, and (iii) Cancer.
Infectiology is the oldest one. It started by a collaboration with Arnaldo Zaha from the Federal University of Rio Grande do Sul in Brazil that focused on pathogenic bacteria living inside the respiratory tract of swines. Since our participation in the H2020 ITN MicroWine, we started interested in infections affecting plants this time, and more particularly vine plants. Rare Diseases on the other hand started by a collaboration with clinicians from the Centre de Recherche en Neurosciences of Lyon (CNRL) and is focused on the Taybi-Linder Syndrome (TALS) and on abnormal splicing of U12 introns, while Cancer rests on a collaboration with the Centre Léon Bérard (CLB) and Centre de Recherche en Cancérologie of Lyon (CRCL) which is focused on Breast and Prostate carcinomas and Gynaecological carcinosarcomas.
The latter collaboration was initiated through a relationship between a member of ERABLE (Alain Viari) and Dr. Gilles Thomas who had been friends since many years. G. Thomas was one of the pioneers of Cancer Genomics in France. After his death in 2014, Alain Viari took the (part time) responsibility of his team at CLB and pursued the main projects he had started.
Within Inria and beyond, the first two applications (Infectiology and Rare Diseases) may be seen as unique because of their specific focus (resp. microbiome and respiratory tract of swines / vine plants on one hand, and TALS on the other). In the first case, such uniqueness is also related to the fact that the work done involves a strong computational part but also experiments that in some cases (respiratory tract of swines) is performed within ERABLE itself.
The main areas of application of ERABLE are: (1) biology understood in its more general sense, with a special focus on symbiosis and on intracellular interactions, and (2) health with a special emphasis for now on infectious diseases, rare diseases, and cancer.
There are three axes on which we would like to focus in the coming years.
Travelling is essential for the team, that is European and has many international collaborations. We would however like to continue to develop as much as possible travelling by train or even car. This is something we do already, for instance between Lyon and Amsterdam by train, and that we have done in the past, such as for instance between Lyon and Pisa by car, and between Rome and Lyon by train, or even in the latter case once between Rome and Amsterdam!
Computing is also essential for the team. We would like to continue our effort to produce resource frugal software and develop better guidelines for the end users of our software so that they know better under which conditions our software is expected to be adapted, and which more resource-frugal alternatives exist, if any.
Having an impact on how data are produced is also an interest of the team. Much of the data produced is currently only superficially analysed. Generating smaller datasets and promoting data reuse could avoid not only data waste, but also economise on computer time and energy required to produce such data.
As indicated earlier, the overall objective of the team is to arrive at a better understanding of close and often persistent interactions among living systems, between such living systems and viruses, between living systems and chemical compounds (environment), among cells within a multicellular organism, and between transposable elements and their host genome. There is another longer-term objective, much more difficult and riskier, a “dream” objective whose underlying motivation may be seen as social and is also environmental.
The main idea we thus wish to explore is inspired by the one universal concept underlying life. This is the concept of survival. Any living organism has indeed one single objective: to remain alive and reproduce. Not only that, any living organism is driven by the need to give its descendants the chance to perpetuate themselves. As such, no organism, and more in general, no species can be considered as “good” or “bad” in itself. Such concepts arise only from the fact that resources, some of which may be shared among different species, are of limited availability. Conflict thus seems inevitable, and “war” among species the only way towards survival.
However, this is not true in all cases. Conflict is often observed, even actively pursued by, for instance, humans. Two striking examples that have been attracting attention lately, not necessarily in a way that is positive for us, are related to the use of antibiotics on one hand, and insecticides on the other, both of which, especially but not only the second can also have disastrous environmental consequences. Yet cooperation, or at least the need to stop distinguishing between “good” (mutualistic) and “bad” (parasitic) interactions appears to be, and indeed in many circumstances is of crucial importance for survival. The two questions which we want to address are: (i) what happens to the organisms involved in “bad” interactions with others (for instance, their human hosts) when the current treatments are used, and (ii) can we find a non-violent or cooperative way to treat such diseases?
Put in this way, the question is infinitely vast. It is not completely utopic. We had the opportunity in recent years to discuss such question with notably biologists with whom we were involved in two European projects (namely BachBerry, http://
The aim will be to reach some proofs of concepts, which may then inspire others, including ourselves on a longer term, to pursue research along this line of thought. Such proofs will in themselves already require to better understand what is involved in, and what drives or influences any interaction.
The research of all team members, in particular of PhD students or Postdocs, is important for us and we prefer not to highlight any in particular.
We indicate in this section the new methods we developed in 2020 but also the older ones that continue to be used and that are being constantly maintained by the researchers of the team. This indeed represents a great part of our effort and time, and is important in general.
Improvements : The KissReads module has been modified and sped up, with a significant impact on run times. Parameters : –timeout default now at 10000: in big datasets, recall can be increased while run time is a bit longer. Bugs fixed : –Reads containing only 'N': the graph construction was stopped if the file contained a read composed only of 'N's. This is was a silence bug, no error message was produced. –Problems compiling with new versions of MAC OSX (10.8+): KisSplice is now compiling with the new default C++ compiler of OSX 10.8+.
KisSplice was applied to a new application field, virology, through a collaboration with the group of Nadia Naffakh at Institut Pasteur. The goal is to understand how a virus (in this case influenza) manipulates the splicing of its host. This led to new developments in KisSplice. Taking into account the strandedness of the reads was required, in order not to mis-interpret transcriptional readthrough. We now use bcalm instead of dbg-v4 for the de Bruijn graph construction and this led to major improvements in memory and time requirements of the pipeline. We still cannot scale to very large datasets like in cancer, the time limiting step being the quantification of bubbles.
No platforms for now due to a lack of human means to develop and especially to maintain one. An initial one is however actively planned for the coming years.
We present in this section the main results obtained in 2020.
We tried to organise these along the four axes as presented above. Clearly, in some cases, a result obtained overlaps more than one axis. In such case, we chose the one that could be seen as the main one concerned by such results.
On the other hand, we chose not to detail the results on more theoretical aspects of computer science when these are initially addressed in contexts not directly related to computational biology even though those on string 29, 1, 4, 36, 33, 20 and graph algorithms in general 5, 14, 34 are relevant for life sciences, such as for instance (pan)genome analysis, or could become more specifically so in a near future.
A few other results of 2020 are not mentioned in this report, not because the corresponding work is not important, but because it was likewise more specialised 35. In the same way, also for space reasons, we chose not to detail the results presented in some biological papers of the team when these did not require a mathematical or algorithmic input 6, 11, 12, 13, 18, 19, 21.
On the other hand, we do mention a work that was in revision and indeed accepted just at the end of 2020 but ended up being published at the very beginning of 2021.
Finally, we wish to call attention to the fact that some members of ERABLE, at CWI and at the University of Pisa, have been working on a theoretical problem which is important in relation to our main area of application. This problem indeed concerns privacy of the information that may be inferred by some of the algorithms developed, and more precisely what has been called string sanitization 8, 31, 32.
Alternative splicing and variant detection
In a paper published in BMC Bioinformatics 23, we introduced a new algorithm and the corresponding tool ebwt2InDel that extends our own previous framework (Prezza et al., Algorithms for Molecular Biology, 14(1): 1-13, 2019) to detect also INDELs, and implements recent findings that allow to perform the whole analysis using just a Burrows-Wheeler Transform, thus reducing the working space by one order of magnitude and enabling the analysis of full genomes. We also describe a simple strategy for effectively parallelising our tool for SNP detection only. On a 24-cores machine, the parallel version of our tool is one order of magnitude faster than the sequential one. The tool is available at https://
Still on the issue of variants, we studied the biallelic variants in RNU4ATAC, a non-coding gene transcribed into the minor spliceosome component U4atac snRNA which are responsible for three rare recessive developmental diseases, namely Taybi-Linder/MOPD1, Roifman and Lowry-Wood syndromes. Next-generation sequencing of clinically heterogeneous cohorts (children with either a suspected genetic disorder or a congenital microcephaly) recently identified mutations in this gene, illustrating how profoundly these technologies are modifying genetic testing and assessment. As RNU4ATAC has a single non-coding exon, the bioinformatic prediction algorithms assessing the effect of sequence variants on splicing or protein function are irrelevant, which makes variant interpretation challenging to molecular diagnostic laboratories. In order to facilitate and improve clinical diagnostic assessment and genetic counselling, we presented i) an update of the previously reported RNU4ATAC mutations and an analysis of the genetic variations affecting this gene using the Genome Aggregation Database (gnomAD) resource; ii) the pathogenicity prediction performances of scores computed based on an RNA structure prediction tool and of those produced by the Combined Annotation Dependent Depletion tool for the 285 RNU4ATAC variants identified in patients or in large-scale sequencing projects; iii) a method, based on a cellular assay, that allows to measure the effect of RNU4ATAC variants on splicing efficiency of a minor (U12-type) reporter intron. Lastly, the concordance of the bioinformatic predictions and cellular assay results was investigated. This work was published in PLoS One 7.
Finally, again on the issue of alternative splicing events, we studied influenza A viruses (IAVs) which use diverse mechanisms to interfere with cellular gene expression. Although many RNA-seq studies had previously documented IAV-induced changes in host mRNA abundance, few were designed to allow an accurate quantification of changes in host mRNA splicing. Here, we showed that IAV infection of human lung cells induces widespread alterations of cellular splicing, with an overall increase in exon inclusion and decrease in intron retention. Over half of the mRNAs that show differential splicing undergo no significant changes in abundance or in their 3' end termination site, suggesting that IAVs can specifically manipulate cellular splicing. Among a randomly selected subset of 21 IAV-sensitive alternative splicing events, most are specific to IAV infection as they are not observed upon infection with VSV, induction of interferon expression or induction of an osmotic stress. Finally, the analysis of splicing changes in RED-depleted cells revealed a limited but significant overlap with the splicing changes in IAV-infected cells. This observation suggests that hijacking of RED by IAVs to promote splicing of the abundant viral NS1 mRNAs could partially divert RED from its target mRNAs. This work was published in Nar Genomics and Bioinformatics 3. All our RNA-seq datasets and analyses are made accessible for browsing through a user-friendly Shiny interface (http://
Bubble generator
Bubbles are pairs of internally vertex-disjoint et al., Algorithmica, 82:898-914, 2019, appeared in 2020). Although this generator was quite effective in finding AS events, preliminary experiments showed that it is about 5 times slower than state-of-art methods. This year, we proposed a new family of bubble generators which improve substantially on the previous one. Indeed, the generators in this new family are about two orders of magnitude faster and are still able to achieve similar precision in identifying AS events. To highlight the practical value of our new generators, we also reported some experimental results on a real dataset. This work was presented at IWOCA 28.
Genome assembly
Generating accurate genome assemblies of large, repeat-rich human genomes has proved difficult using only long, error-prone reads, and most human genomes assembled from long reads add accurate short reads to polish the consensus sequence. We developed a hybrid assembly method, that we called Wengan, which provides the highest quality at a low computational cost. We applied Wengan to perform a de novo assembly of four human genomes using a combination of sequencing data generated on ONT PromethION, PacBio Sequel, Illumina and MGI technology. Wengan implements efficient algorithms to improve assembly contiguity as well as consensus quality. The resulting genome assemblies thus have high contiguity (contig NG50:17.24-80.64 Mb), few assembly errors (contig NGA50:11.8-59.59 Mb), good consensus quality (QV:27.84-42.88), and high gene completeness (BUSCO complete: 94.6-95.2%), while consuming low computational resources (CPU hours:187-1,200). In particular, the assembly of the haploid CHM13 sample achieved a contig NG50 of 80.64 Mb (NGA50:59.59 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50:57.88 Mb). The paper presentinh Wengan was published in Nature Biotechnology 16. Wengan is available at
https://
On the same topic still, we also worked in the context of an haplotype-aware genome assembly whose aim is to reconstruct all individual haplotypes from a mixed sample and to provide the corresponding abundance estimates. We developed a reference-genome-independent solution based on the construction of a variation graph that captures all the diversity present in the sample. We solved the contig abundance estimation problem and proposed a greedy algorithm to efficiently build maximal-length haplotypes. We then obtained accurate frequency estimates for the reconstructed haplotypes using linear programming techniques. Our method outperforms the state-of-the-art approaches on viral quasispecies benchmarks and has the potential to assemble bacterial genomes in a strain-aware manner as well. This work was presented at RECOMB 30.
Binning
The human gut microbiota performs functions that are essential for the maintenance of the host physiology. However, characterising the functioning of microbial communities in relation to the host remains challenging in reference-based metagenomic analyses. Indeed, as taxonomic and functional analyses are performed independently, the link between the genes and the species remains unclear. Although a first set of species-level bins was built by clustering co-abundant genes, no reference bin set is established on the most used gut microbiota catalog, the Integrated Gene Catalog (IGC). With the aim to identify the best suitable method to group the IGC genes, we benchmarked nine taxonomy-independent binners implementing abundance-based, hybrid and integrative approaches. To this purpose, we designed a Simulated non-redundant Gene Catalog (SGC) and computed adapted assessment metrics. We showed that, overall, the best trade-off between the main metrics is reached by an integrative binner. For each approach, we then compared the results of the best-performing binner with our expected community structures and applied the method to IGC. We showed that the three approaches are distinguished by specific advantages, and by inherent or scalability limitations. We conclude from this that hybrid and integrative binners show promising and potentially complementary results but require improvements to be used on IGC to recover human gut microbial species. This work was submitted to NAR Genomics and Bioinformatics in 2020, and is now accepted. It will be published in early 2021. This is the work of the PhD student Marianne Borderes, co-supervised by M.-F. Sagot and S. Vinga (Instituto Superior Técnico, Lisbon), and funded by the ANR Technology 9.1
with the company MaatPharma, under the supervision of Lilia Boucinha initially, and then since 2019 of Emmanuel Prestat. This work with S. Vinga was part of the Inria Associated Team 10.1.1 which lasted from 2018 until this year (2020).
Metabolism
In a paper published in BMC Bioinformatics 2, we explored the concept of multi-objective optimisation in the field of metabolic engineering when both continuous and integer decision variables are involved in the model. In particular, we proposed a multi-objective model that may be used to suggest reaction deletions that maximise and/or minimise several functions simultaneously. The applications may include, among others, the concurrent maximisation of a bioproduct and of biomass, or maximisation of a bioproduct while minimising the formation of a given by-product, two common requirements in microbial metabolic engineering. Production of ethanol by the widely used cell factory Saccharomyces cerevisiae (Yeast) was adopted as a case study to demonstrate the usefulness of the proposed approach in identifying genetic manipulations that improve productivity and yield of this economically highly relevant bioproduct. We did an in vivo validation and we could show that some of the predicted deletions exhibit increased ethanol levels in comparison with the wild-type strain. The multi-objective programming framework we developed, called Momo, is open-source and uses Polyscip (available at http://Momo itself is available at https://
This work was then continued with a PhD student, Irene Ziska, funded by Inria and co-supervised by M.-F. Sagot and S. Vinga. It was also part of the project of the Inria Associated Team 10.1.1. In the case of Irene’s work, and always in collaboration also with N. Mira, we have developed a new method that identifies potential metabolic engineering strategies such as gene and reaction knock-outs as did Momo, however this time does also explicitly take into account that in some cases, the target chemical can be toxic for the microorganism itself, which might render the production unstable. This new method thus aims to identify knock-outs which increase the production of the target and which, at the same time, ensure that the microorganism keeps a high resistance against the toxic target. In a first step, our approach uses multi-objective linear optimisation to find valid trade-offs between growth, target production and toxicity resistance against the target. Afterwards, potential knock-outs are enumerated and then ranked to choose the best candidates for a desired trade-off. The toxicity resistance is measured by the activity of a set of critical reactions that have to be known or identified experimentally as a prerequisite. To test our method, we applied it to identify knock-outs for the production of ethanol in Yeast. A paper is being prepared to be submitted in early 2021.
Finally, still on metabolism, we also submitted an article that presents a novel computational method called Totoro (for "Transient respOnse to meTabOlic pertuRbation inferred at the whole netwOrk level") and which integrates internal metabolites concentrations that were measured before and after a perturbation into genome-scale metabolic reconstructions in order to predict the reactions that were active during the transient state that occurred after the perturbation. The proposed method is a constraint-based approach that takes the stoichiometry of the network into account. It minimises the change in concentrations for unmeasured metabolites and also the number of active reactions during transient state to account for a parsimonious assumption. An implementation in C++ is freely available at https://Totoro is able to handle full networks and to consider in the model stoichiometry, cycles, reversible reactions as well as co-factors. This work is also part of Irene Ziska’s PhD 38, and of the Inria Associated Team project 10.1.1.
Post-transcriptional regulation
MicroRNAs (miRNAs) belong to a class of small non-coding RNAs (ncRNAs) of 18-24 nucleotides in part responsible for post-transcriptional gene regulation in eukaryotes. These evolutionarily conserved molecules influence fundamental biological processes, including cell proliferation, differentiation, apoptosis, immune response, and metabolism. Accurately identifying miRNAs has however proven difficult. In the last decade, with the increasing accessibility of high-throughput sequencing technologies, different methods have been developed to identify miRNAs, but most of them rely exclusively on pre-existing reference genomes. Despite all the advancements in the sequencing technologies and de novo assembly algorithms, few complete genomes are available today. This represents a recurrent problem for researchers working on non-model species. The lack of a high-quality reference genome thus reduces the possibilities for discovering novel miRNAs. In a paper currently under revision, we introduced BrumiR, which is a package composed of three tools; 1) a new discovery miRNA tool (BrumiR-core) a specific genome mapper (Brumir2Reference), and 3) a sRNA-seq read simulator (miRsim). In particular, BrumiR-core is a de novo algorithm based on a de Bruijn graph approach that is able to identify miRNAs directly and exclusively from sRNA-seq data. Unlike other state-of-the-art tools, BrumiR does not rely on a reference genome, on the availability of close phylogenetic species, or on conserved sequence information. Instead, BrumiR starts from a de Bruijn graph encoding all the reads and is able to directly identify putative mature miRNAs on the generated graph. Along with miRNA discovery, BrumiR assembles and identifies other types of small and long non-coding RNAs expressed within the sequencing data. Additionally, when a reference genome is available, BrumiR provides a new mapping tool (BrumiR2reference) that performs an exhaustive search to identify and validate the precursor sequences. We extensively benchmarked the BrumiR toolkit on animal and plant species using simulated and real datasets. The benchmark results show that BrumiR is very sensitive, is the fastest tool, and its predictions were supported by the characteristic hairpin structure of miRNAs. Finally, we showed the power of BrumiR for discovering novel miRNAs in the model plant Arabidopsis thaliana. We sequenced a total of 18 sRNA-seq libraries from different stages of root development and used the BrumiR toolkit to analyze our data. We annotated three novel miRNAs involved in root development, showing on a real biological situation how BrumiR catches novel information even in the case of highly annotated genomes. The paper presenting BrumiR is currently in revision. Its preprint is available in BioRxiv https://
Cophylogeny
Phylogenetic tree reconciliation is the method of choice in analysing host-symbiont systems. Despite the many reconciliation tools that have been proposed in the literature, two main issues remain unresolved: (i) listing suboptimal solutions (i.e. whose score is "close" to the optimal ones) and (ii) listing only solutions that are biologically different "enough". The first issue arises because the optimal solutions are not always the ones biologically most significant; providing many suboptimal solutions as alternatives for the optimal ones is thus very useful. The second one is related to the difficulty to analyse a number of optimal solutions that is often exponential. We then proposed a method, that we called Capybara for "equivalence ClAss enumeration of coPhylogenY event-BAsed ReconciliAtions" that addresses both of these problems in an efficient way. Furthermore, Capybara includes a tool for visualising the solutions that significantly helps the user in the process of analysing the results. The source code, documentation, and binaries for all platforms are freely available at https://Bioinformatics 27.
The problem of an efficient enumeration of equivalence classes or of one representative per class (without generating all the solutions), although identified as a need in many areas, has been addressed only for very few specific cases. In 2020, we started working on providing a general framework that solves this problem in polynomial delay in a wide variety of contexts, including optimisation ones that can be addressed by dynamic programming algorithms, and for certain types of equivalence relations between solutions. This theoretical work thus applies to a broad set of problems, among which phylogenetic tree reconciliation which initially motivated it. The work will be submitted in early 2021. It is already available in arXiv (https://
Theoretical aspects of cytoplasmic incompatibility
Another work, more purely theoretical but that was originally motivated by a question that is in some sense related to coevolution, concerned cytoplasmic incompatibility, CI for short. CI relates to the manipulation by the parasite Wolbachia of its host reproduction. Despite its widespread occurrence, the molecular basis of CI remains unclear and theoretical models have been proposed to understand the phenomenon. We considered in the work published in Algorithms for Molecular Biology 10 the quantitative Lock-Key model which currently represents a good hypothesis that is consistent with the data available. CI is in this case modelled as the problem of covering the edges of a bipartite graph with the minimum number of chain subgraphs. This problem was already known to be NP-hard, and we provide an exponential algorithm with a non-trivial complexity. It is frequent that depending on the dataset, there may be many optimal solutions which can be biologically quite different among them. To rely on a single optimal solution may therefore be problematic. To this purpose, we addressed the problem of enumerating (listing) all minimal chain subgraph covers of a bipartite graph and showed that it can be solved in quasi-polynomial time. Interestingly, in order to solve the above problems, we considered also the problem of enumerating all the maximal chain subgraphs of a bipartite graph and improved on the current results in the literature for the latter. Finally, to demonstrate the usefulness of our methods, we did show an application on a real dataset.
Human
The essential part of the work on human health was focused on cancer and rare diseases. In the second case, such work concerns variants of three rare recessive developmental diseases, namely Taybi-Linder/MOPD1, Roifman and Lowry-Wood syndromes, and influenza A viruses. These works are presented in the context of (pan)genomics and transcriptomics in general, and more precisely of 8.2.
As concerns cancer, and more precisely breast cancer, two works were published in 2020.
The first one examined cancer cell plasticity and malignant progression, both of which remain poorly understood. In the paper published in iScience 25, we showed that the uncharacterised epigenetic factor chromodomain on Y-like 2 (CDYL2) is commonly over-expressed in breast cancer, and that high CDYL2 levels correlate with poor prognosis. Supporting a functional role for CDYL2 in malignancy, it positively regulated breast cancer cell migration, invasion, stem-like phenotypes, and epithelial-to-mesenchymal transition. CDYL2 regulation of these plasticity-associated processes depended on signalling via p65/NF-kB and STAT3. This, in turn, was downstream of CDYL2 regulation of MIR124 gene transcription. CDYL2 co-immunoprecipitated with G9a/EHMT2 and GLP/EHMT1 and regulated the chromatin enrichment of G9a and EZH2 at MIR124 genes. We then proposed that CDYL2 contributes to poor prognosis in breast cancer by recruiting G9a and EZH2 to epigenetically repress MIR124 genes, thereby promoting NF-kB and STAT3 signalling, as well as downstream cancer cell plasticity and malignant progression.
The second work on breast cancer concerned discovering disease signatures or subtypes through gene expression data analysis. Although this shows great promise, it is also prone to technical variation whose removal is essential to avoid spurious discoveries. Because this variation is not always known and can be confounded with biological signals, its removal is however a challenging task. In the paper published in Communications Biology 17, we provided a step-wise procedure and comprehensive analysis of the MINDACT microarray dataset. The MINDACT trial enrolled 6693 breast cancer patients and prospectively validated the gene expression signature MammaPrint for outcome prediction. The study also yielded a full-transcriptome microarray for each tumor. We showed for the first time in such a large dataset how technical variation can be removed while retaining expected biological signals.
Animal
Mycoplasma hyopneumoniae is the most costly pathogen for swine production. Although several studies have focused on the host-bacterium association, little is known about the changes in gene expression of swine cells upon infection. To improve our understanding of this interaction, we infected swine epithelial nptr cells with M. hyopneumoniae strain J to identify differentially expressed mRNAs and miRNAs. The levels of 1,268 genes and 170 miRNAs were significantly modified post-infection. Up-regulated mRNAs were enriched in genes related to redox homeostasis and antioxidant defense, known to be regulated by the transcription factor NRF2 in related species. Down-regulated mRNAs were enriched in genes associated with cytoskeleton and ciliary functions. Bioinformatic analyses suggested a correlation between changes in miRNA and mRNA levels, since we detected down-regulation of miRNAs predicted to target antioxidant genes and up-regulation of miRNAs targeting ciliary and cytoskeleton genes. Interestingly, most down-regulated miRNAs were detected in exosome-like vesicles suggesting that M. hyopneumoniae infection induced a modification of the composition of NPTr-released vesicles. Taken together, our data indicate that M. hyopneumoniae elicits an antioxidant response induced by NRF2 in infected cells. In addition, we propose that ciliostasis caused by this pathogen is partially explained by the down-regulation of ciliary genes. This work was published in Scientific Reports 22.
Others
Finally, we had in 2020 a work that is also related to health although in a less direct way. Indeed, in a paper published in Journal of Operational Research 26, we considered the problem of scheduling patients in allocated surgery blocks in a Master Surgical Schedule. We paid attention to both the available surgery blocks and the bed occupancy in the hospital wards. More specifically, large probabilities of overtime in each surgery block are undesirable and costly, while large fluctuations in the number of used beds requires extra buffer capacity and makes the staff planning more challenging. The stochastic nature of surgery durations and length of stay on a ward hinders the use of classical techniques. Transforming the stochastic problem into a deterministic problem does not result into practically feasible solutions. In this paper we developed a technique to solve the stochastic scheduling problem, whose primary objective it to minimise variation in the necessary bed capacity, while maximising the number of patients operated, and minimising the maximum waiting time, and guaranteeing a small probability of overtime in surgery blocks. The method starts with solving an Integer Linear Programming (ILP) formulation of the problem, and then simulation and local search techniques are applied to guarantee small probabilities of overtime and to improve upon the ILP solution. Numerical experiments applied to a Dutch hospital showed promising results.
Laurent Jacob works with Pendulum Therapeutics (previously Whole Biome) since 2019, with whom he signed an Non Disclosure Agreement and via whom he collaborates with Hector Roux de Bezieux, who is a PhD student in biostatistics at the University of California, Berkeley, USA, who is a computational biologist at the company.
Chile
Besides the collaboration with Elena Vidal from the Universidad Mayor, Santiago, mentioned below, we have informal collaborations with Rodrigo Gutiérrez from the Universidad Catolica of Santiago who was co-supervisor of Carol Moraga Quinteros with M.-F. Sagot, as well as with Vicente Acuña, who is Scientist at the Centro de Modelamiento Matemático (CMM), at Santiago.
Brazil
Ahimsa
Fapesp-UdL
Chile
ERABLE participates in the project Network for Organismal Interactions Research (NOIR) funded by Conicyt in Chile within the call International Networking between Research Centers. The project started in 2019 and will last until the end of 2020. The coordinator on the Chilean side is Elena Vidal from the Universidad Mayor, Santiago, Chile, and the Erable participants are Carol Moraga Quinteros, Mariana Ferrarini and Marie-France Sagot.
Due to the Covid-19, a number of planned visits, notably from Brazil, Chile, Portugal, etc, had to be cancelled. The only one that took place in the first two months of 2020 concerned Nuno Mira, our collaborator from the Instituto Superior Técnico (IST), Lisbon, who was cominh for a Sabbatical until the middle of 2020. Nuno managed to get back to his family in Lisbon just as the first confinement was going to start in mid-March. He has been negociating with IST to postpone his Sabbatical, and thus his long visit to us to 2021, or maybe even later depending on how the situation evolves.
Blerina Sinaimeri did a 12-month Sabbatical at Luiss University in Rome, Italy, where a member of ERABLE, Giuseppe Italiano, is Full Professor, thus interacting with him and also with another ERABLE team member, Alberto Marchetti-Spaccamela who is Full Professor at Sapienza University in Rome. Blerina's stay actually started in July 2019, and will be extended until the end of January 2021. Starting from February 2021, Blerina will continue at Luiss University as an Associate Professor, with a 3-years Detachment from Inria. Blerina will continue to be member of ERABLE as is Giuseppe Italiano, Alberto Marchetti-Spaccamela and other researchers at the University of Pisa in Italy.
Two ERABLE members in the Netherlands, Solon Pissis and Leen Stougie, participate in an H2020 MSCA-RISE project (2020-2022) called Pangaia (Pan-genome Graph Algorithms and Data Integration) coordinated by Paola Bonizzoni, University of Milan, Italy.
Furthermore, in 2020, an H2020 Twinning project was accepted that is coordinated by INESC-ID, Instituto Superior Técnico, Lisbon, Portugal and of which ERABLE is a partner (the coordinator in France is Marie-France Sagot). The project focuses on Computational Biology, a strongly interdisciplinary area that combines Computer Science, Algorithms, Mathematics, Probability and Statistics, Machine Learning, Molecular Biology and Medicine. The project consortium is composed of INESC-ID (Coordinator), the National Institute for Research in Digital Science and Technology (Inria) through the Erable team, the Swiss Federal Institute of Technology (ETH Zürich) in Switzerland and the European Molecular Biology Laboratory (EMBL) in Germany. The main goal of the project is to intensify, increase and consolidate the research in Computational Biology carried out at INESC-ID in partnership with the European partner institutions.
Due to the Covid-19, the start of this project was delayed until January 1st, 2021. It will last until the end of 2023, unless it is extended due to the fact that some of the planned initiatives for the first year may not be realisable, once again because of the Covid-19.
In the same way, another H2020 project, in this case an ITN with acronym Alpaca that involves members of ERABLE has been accepted in 2020 but will start only in 2021. Two members of ERABLE will host a PhD student in their institutions, namely Solon Pissis at CWI and Nadia Pisanti at the University of Pisa. Other members of ERABLE will be involved in Alpaca.
By itself, ERABLE is built from what initially were collaborations with some major European Organisations (CWI, Sapienza University of Rome, Universities of Florence and Pisa, Free University of Amsterdam) and then became a European Inria Team.
Notice that were included here national projects of our members from Italy and the Netherlands when these have no other partners than researchers from the same country.
Members of ERABLE have reviewed papers for a number of workshops and conferences including: CIAC, CPM, ISBRA, ISMB, LAGOS, MLCB, NeurIPS, RECOMB, WEPA, WABI.
Members of ERABLE have reviewed papers for a number of journals including:
Theoretical Computer Science, Algorithmica, Algorithms for Molecular Biology, Bioinformatics, BMC Bioinformatics, Genome Biology, Genome Research, IEEE/ACM Transactions in Computational Biology and Bioinformatics (TCBB), Machine Learning, Molecular Biology and Evolution, Nucleic Acid Research.
Hubert Charles was until this year director of the Biosciences Department of the Insa-Lyon and co-director of studies of the “Bioinformatique et Modélisation (BIM)” track.
Roberto Grossi is member of the International Olympiad in Informatics.
Giuseppe Italiano is Vice-President of the European Association for Theoretical Computer Science (EATCS) and Director of the Master of Science in Data Science and Management, LUISS University, Rome, besides having a number of other responsabilities at LUISS. He is also member of the Advisory Board of MADALGO - Center for MAssive Data ALGOrithmics, Aarhus, Denmark.
Laurent Jacob is alternate member of the “Conseil National des Universités” (CNU) 26 (“Applied Mathematics and Applications of Mathematics”). He is in the "Conseil Pédagogique et d'Orientation" and participated to the recruitment committee for an Associate Professor at Polytech Grenoble.
Nadia Pisanti is since November 1st 2017 member of the Board of the PhD School in Data Science (University of Pisa jointly with Scuola Normale Superiore Pisa, Scuola S. Anna Pisa, IMT Lucca).
Marie-France Sagot is member of the Advisory Board of CWI, Amsterdam, the Netherlands. In 2020, she was part of the ERC Consolidator Panel for LS2 and a member of the Review Committee for the Human Frontier Science Program.
Blerina Sinaimeri was member of the Inria National Commission for the recruitment of junior researchers, CRCN and ISFP, in 2020.
Leen Stougie is since April 2017 Leader of the Life Science Group at CWI. He is member of the General Board of the Dutch Network on the Mathematics of Operations Research (Landelijk Netwerk Mathematische Besliskunde (LNMB)), member of the Management Team of the Gravity project Networks, and member of the Gijs de Leve Award 2021 committee.
Alain Viari is member of a number of scientific advisory boards (IRT (Institut de Recherche Technologique) BioAster; Centre Léon Bérard). He also coordinates together with J.-F. Deleuze (CNRGH-Evry) the Research & Development part (CRefIX) of the “Plan France Médecine Génomique 2025”.
Fabrice Vavre is President of the Section 29 of the CoNRS8.
Cristina Vieira is member of the “Conseil National des Universités” (CNU) 67 (“Biologie des Populations et Écologie”), and since 2017 member of the “Conseil de la Faculté des Sciences et Technologies (FST)” of the University Lyon 1.
The members of ERABLE teach both at the Department of Biology of the University of Lyon (in particular within the BISM (BioInformatics, Statistics and Modelling) specialty, and at the department of Bioinformatics of the Insa (National Institute of Applied Sciences).
Cristina Vieira is responsible for the Master Biodiversity, Ecology and Evolution (https://
The ERABLE team regularly welcomes M1 and M2 interns from the bioinformatics Master.
All French members of the ERABLE team are affiliated to the doctoral school E2M2 (Ecology-Evolution-Microbiology-Modelling, http://
Italian researchers teach between 90 and 140 hours per year, at both the undergraduate and at the Master levels. The teaching involves pure computer science courses (such as Programming foundations, Programming in C or in Java, Computing Models, Distributed Algorithms) and computational biology (such as Algorithms for Bioinformatics).
Dutch researchers teach between 60 and 100 hours per year, again at the undergraduate and Master levels, in applied mathematics (e.g.
Operational Research, Advanced Linear Programming), machine learning (Deep Learning) and computational biology (e.g. Biological Network Analysis, Algorithms for Genomics).
The following PhDs were defended in ERABLE in 2020:
The following are the PhDs in progress:
The following are the PhD or HDR juries to which members of ERABLE participated in 2020.