Section:
New Results
Approximate pattern matching
The problem of measuring the similarity between two strings arises in
many areas of sequence analysis. A common metric for it is the
Levenshtein distance.
This distance is defined as the smallest number of substitutions,
insertions, and deletions of symbols required to transform one of the
words into the other. We have investigated the basic problem of the
size of the neighborhood of a given pattern : count how many
strings are within a bounded distance of a fixed reference string.
There has been no efficient algorithm for calculating it so far. We have proposed
a dynamic programming algorithm that scales linearly with the size of
the pattern . For that, we have introduced a new variant of the
universal Levenshtein automaton, that is interesting by itself
and that
can have many other applications in text algorithms [31].
We have also addressed the related problem of approximate pattern
matching: Given a text and a pattern , find all
locations in that differ by at most errors (in the sense
of the Levenshtein distance) from . We
have proposed a new kind of seeds (the 010 seeds) that combines exact parts and parts with a fixed
number of errors, and that are specifically well-suited for short
DNA motifs with high error-rate. We have demonstrated the
applicability of those seeds on two main case studies : pattern matching on a genomic
scale with a Burrows-Wheeler transform, and multi-pattern
matching with indexation of the set of patterns [30].