ERABLE - 2017 - Annual activity report

ERABLE

ERABLE - 2017

Project-Team Erable

Personnel

Overall Objectives

Research Program

Application Domains

Biology & Health

New Software and Platforms

New Results

Bilateral Contracts and Grants with Industry

Bilateral Grants with Industry

Partnerships and Cooperations

Dissemination

Bibliography

Publications of the year

Previous |

Home | Next next

Section: New Results

Identifying the molecular elements

Motif tries for pattern discovery. In [14], the motif trie data structure was introduced to improve the extraction of recurring patterns in sequences. Such extraction concerned maximal patterns with at most $k$ don't care symbols and at least $q$ occurrences, according to a given maximality notion. The motif trie was applied to this problem, also showing how to build it efficiently. This led to the first algorithm that attains a stronger notion of output-sensitivity, where the cost for an input sequence of $n$ symbols is proportional to the actual number of occurrences of each pattern, which is at most $n$ (much smaller in practice). This avoids the best-known cost of $O (n c) O (n c)$ per pattern, for a constant $c > 1$ , which is otherwise impractical for massive sequences with a large value of $n$ .

Identification of genome and alternative splicing variants in RNA-seq data. The team's work on identifying alternative splicing and other genome variants such as SNPs (Single Nucleotide Polymorphism), indels, etc., started around 2010. This has concerned mostly RNA-seq data also for the variants investigated.

Both DNA and RNA-seq data analysis using so-called NGS (Next Generation Sequencing) is a domain of research that has been active for decades now, with many open questions remaining despite such long and intense activity. One is the case of non-model organisms, but actually there is another major problem that has not been solved, at least in any really satisfying way since the premises of genome sequencing. This is the problem of repeats. Notice however that repeats are not just “problems to be avoided”, but have a strong biological interest in themselves, notably those related to transposable elements. Various papers of the team in 2017, notably [13], [19], [22], [1], were concerned with the study of such elements.

As concerns non-model organisms, the team extended a method it had previously developed, called KisSplice , to identify, quantify and annotate SNPs without any reference genome, using RNA-seq data only. The paper (Lopez-Maestre et al., Nucleic Acids Research, 44(19):e148, 2016) appeared at the end of 2016. There we showed that individuals can be pooled prior to sequencing if not enough material is available from one individual. Using pooled human RNA-seq data, we clarified the precision and the recall of our method and discussed them with respect to others which use a reference genome or an assembled transcriptome. We then validated experimentally the predictions of our method using RNA-seq data from two non-model species. The method can be used for any species to annotate SNPs and predict their impact on the protein sequence. It enables to test for the association of the identified SNPs with a phenotype of interest. One of the phenotypes explored was related to the dependence of the insect Asobara tabida on its endosymbiont Wolbachia.

The methodological part of the work above relied in part on a number of more theoretical results, related to algorithmics and more specifically focused on the problem of repeats [21]. The most theoretical recent work of the team, accepted at the 43rd International Workshop on Graph-Theoretic Concepts in Computer Science (WG) in 2017 [30], proposed the notion of a bubble generator set, i.e. of a polynomial-sized subset of bubbles from which all the others can be obtained, also in polynomial time, through the application of a specific symmetric difference operator. This is further described in the last axis (Axis 6).

Genome and haplotype assembly. Fully assembling the genome sequence of an organism remains an important and challenging task. Genome scaffolding (i.e. the process of ordering and orientating contigs) of de novo assemblies usually represents the first step in most genome finishing pipelines. The team started by developing an algorithm (called MeDuSa ) for such task (Bosi et al., Bioinformatics, 31(15):2443-2451, 2015). It exploited information obtained from a set of (draft or closed) genomes from related organisms to determine the correct order and orientation of the contigs. It formalised the scaffolding problem by means of a combinatorial optimisation formulation on graphs and implements an efficient constant factor approximation algorithm to solve it. In contrast to the majority of the scaffolders, it did not require either prior knowledge on the input dataset (usually of micro-organisms) or the availability of paired-end read libraries. MeDuSa however presented limitations both in the construction of the scaffolding graph for large genomes, and in the subsequent assembly. The first aspect has been recently greatly improved by a method developed in collaboration with researchers (among which Alex di Genova) from Chile. This work led to the software Fast-SG already publicly available, and to a first publication that is in revision.

Previous |

Home | Next next