Copyright
Gad M Landau.

Efficient string matching with k mismatches online

. (page 1 of 1)
Online LibraryGad M LandauEfficient string matching with k mismatches → online text (page 1 of 1)
Font size
QR-code for this ebook


Computer Science Departmen



N






u




XI

— 1


tp


1


C rS^


«


•rH


E-i


M XI




-P -P


H


U) -H


U


s


in


+J


cu


C tJi


s


Q) C


O


•rH -rH



•rH u



TECHNICAL REPORT



EFFICIENT STRING MATCHING WITH k MISMATCHES



Gad M, Landau
Uzi Vishkin^



1



June 1985

Technical Report #167



NEW YORK UNIVERSITY




Department of Computer Science
Courant Institute of Mathematical Sciences

251 MERCER STREET, NEW YORK, N.Y. 10012



i^T-M p^t'



«■ ™ J>f4»^



wrw



DC C.



CO —
EC -f



o — ^ -•

>- v- oo



EFFICIENT STRING MATCHING WITH k MISMATCHES

Gad M. Landau^
Uzi Vishkin^



June 1985

Technical Report #167



1 De^ar-Tient of Comi:uter Science, School of Me.Lherr.a-.ica-: Sciences, :e.\v:v ..niversity
J- ^"4:^.^-. o: Com.u-.er Science, Courant institute of Hatnen.atica. Sciences, vfew 'r orx
U-.v;;s;^"Ind (^reseat address) Department of Coni.uter Science. Sc.v.o; o^ Matne^ticai
Sciences,' Tel Av:v University. The research of this autnor ^'.?.3 supported :y N..- g, .nt .n..
DCS-8318874 and ONR grant .VOOl 4-a5-K-0046.



lEFFICIENT STEING MATCHING WITH k MISMATCHES



Ckid M. Landau*^
Uzi Viskkin*^



ABSTRACT

Given a text of length n, a pattern of length m and an integer
k, we present an algorithm for finding all occurrences of the pat-
tern in the text, each with at most k mismatches. The algorithm
runs in 0{k[mlQgTn + n) time.

1. INTEODUCTION

The problem of string matching xuith k misTnatchss is defined as follows.
Suppose we are given a text of length n , a pattern of length m and an integer k .
Find all occurrences of the pattern in the text with at most k location in which
the text and the pattern have different symbols. Note that the case fc - is the
extensively studied string matching problem. Let us mention a few notable algo-
rithms for the string matching problem: linear time serial algorithms - i_BM],
[GS], [KMP], [KR] (a randomized algorithm) and [Y], parallel algorithms [G] and
[V]. The problem has a strong pragmatical flavor. In practice, we often need to
analize situations where the data is not completely reliable. SpecLfically, con-
sider a situation where the strings which are the input for our problem contain
errors and we still need to find cdl possible occurrences of the pattern in the
text as m reality. Assuming some bound on the number of errors would clearly
imply our problem.

We present an algorithm for string matching with k m.i3matches which runs in
time 0{k[mlQgm. + n)) on a random-access-machine (R^VI) [AHU].

After all the results m the present paper have been achieved, A. Slisenko
has brought to our attention the paper [I] in which another algorithm for the



1. Department of Computer Science, School of Matherr.aticai Sciences, Tei Av.v university

2. Departrr.ent of Computer Science, Courant Institute of Mathematicai Sciences, N'ew York
University and (present address) Department of Com.puter Science, School of Mathematical
Sciences, Tei Aviv University. The research of this author was sutiTiortad by XSF grant XS?-
DCa-8318874 and OX?, gran: >r0014-85-K-OO46.



-2-

same problem has been given. Ivanov claims that his algorithm runs in time
0{f {k){n+m)), where f {k) is a function ot k . f (k) is described by a combina-
tion of two intricate recursive inequalities. No additional hints regarding the
behaviour ol f {k) were found in his paper. We were unable to solve these mequli-
ties. However, we managed to show that f (k) is bounded from below by 2*^ for
every positive integer k. It might be that f {k) grows even subtantially faster
than 2*. His algorithm runs faster than ours only when k is very small and m
and n are almost of the same order of magnitude. In all other cases, our algo-
rithm is faster. An even more important advantage of our algorithm is that it is
simple and intuitive while Ivanov' s algorithm is very complicated. ( Its descrip-
tion needed over 40 journal pages).

II. ANALYSIS OF THE TEXT

Our algorithm has two parts. In the first part the pattern is analyzed. The
outcome of this analysis is used m the second part for analyzing the text. The
next section describes the first part. The present section is devoted to the
second part. We show how to use the results of the pattern analysis in order to
find all occurrences of the pattern m the text with at most k mismatches.

The input to the text analysis consists of the following:

a) The patteren. An array 4 = a^ . . . , a^.

b) The text. An array T = t ^, . . . , t^.

c) The output of the pattern analysis. A two dimentional array

PAT -MISMATCH [I ttl -1;1,...,2A: + l]. Where, row i of the array

{FAT -MIS MATCH {i.l) PAT -MISMATCH {i. 2k + 1)). contains the 2A: + 1

first locations in which cl^^.^, . . , , a^ has different sjnnbols than
ai a^_i . (PAT -MISMATCH [i .v) =/ means that a^+f ^ a.j and / is

the mismatch number v from left to right).

If there aire only c < 2,fc-i-l mismatches between c^ + i, . . . .a^ and
dj, , , . ,o.jp.-\ ^'ve enter the default value m +1 from location c -Hi on. That is,
PAT -MISMATCH [i.c +1) = PAT-MISMATCH[i .Zk +1) = tth-:.

The text is analyzed into the array TEXT -MISMATCH [Q n-m;: k + -].

roiiovving 1.1J.S i,eA.v. anaivsii, r^" i oi. Lij.g a.r. a.;/ i^iiivii — .m ^ .


1

Online LibraryGad M LandauEfficient string matching with k mismatches → online text (page 1 of 1)