Section: New Results
Approximate pattern matching
The problem of measuring the similarity between two strings arises in many areas of sequence analysis. A common metric for it is the Levenshtein distance. This distance is defined as the smallest number of substitutions, insertions, and deletions of symbols required to transform one of the words into the other. We have investigated the basic problem of the size of the neighborhood of a given pattern : count how many strings are within a bounded distance of a fixed reference string. There has been no efficient algorithm for calculating it so far. We have proposed a dynamic programming algorithm that scales linearly with the size of the pattern . For that, we have introduced a new variant of the universal Levenshtein automaton, that is interesting by itself and that can have many other applications in text algorithms [31].
We have also addressed the related problem of approximate pattern matching: Given a text and a pattern , find all locations in that differ by at most errors (in the sense of the Levenshtein distance) from . We have proposed a new kind of seeds (the 010 seeds) that combines exact parts and parts with a fixed number of errors, and that are specifically well-suited for short DNA motifs with high error-rate. We have demonstrated the applicability of those seeds on two main case studies : pattern matching on a genomic scale with a Burrows-Wheeler transform, and multi-pattern matching with indexation of the set of patterns [30].