Online Library → Gad M Landau → Efficient string matching with k mismatches → online text (page 1 of 1)

Font size

Computer Science Departmen

N

â€¢

u

XI

â€” 1

tp

1

C rS^

Â«

â€¢rH

E-i

M XI

-P -P

H

U) -H

U

s

in

+J

cu

C tJi

s

Q) C

O

â€¢rH -rH

â€¢rH u

TECHNICAL REPORT

EFFICIENT STRING MATCHING WITH k MISMATCHES

Gad M, Landau

Uzi Vishkin^

1

June 1985

Technical Report #167

NEW YORK UNIVERSITY

Department of Computer Science

Courant Institute of Mathematical Sciences

251 MERCER STREET, NEW YORK, N.Y. 10012

i^T-M p^t'

Â«â– â„¢ J>f4Â»^

wrw

DC C.

CO â€”

EC -f

o â€” ^ -â€¢

>- v- oo

EFFICIENT STRING MATCHING WITH k MISMATCHES

Gad M. Landau^

Uzi Vishkin^

June 1985

Technical Report #167

1 De^ar-Tient of Comi:uter Science, School of Me.Lherr.a-.ica-: Sciences, :e.\v:v ..niversity

J- ^"4:^.^-. o: Com.u-.er Science, Courant institute of Hatnen.atica. Sciences, vfew 'r orx

U-.v;;s;^"Ind (^reseat address) Department of Coni.uter Science. Sc.v.o; o^ Matne^ticai

Sciences,' Tel Av:v University. The research of this autnor ^'.?.3 supported :y N..- g, .nt .n..

DCS-8318874 and ONR grant .VOOl 4-a5-K-0046.

lEFFICIENT STEING MATCHING WITH k MISMATCHES

Ckid M. Landau*^

Uzi Viskkin*^

ABSTRACT

Given a text of length n, a pattern of length m and an integer

k, we present an algorithm for finding all occurrences of the pat-

tern in the text, each with at most k mismatches. The algorithm

runs in 0{k[mlQgTn + n) time.

1. INTEODUCTION

The problem of string matching xuith k misTnatchss is defined as follows.

Suppose we are given a text of length n , a pattern of length m and an integer k .

Find all occurrences of the pattern in the text with at most k location in which

the text and the pattern have different symbols. Note that the case fc - is the

extensively studied string matching problem. Let us mention a few notable algo-

rithms for the string matching problem: linear time serial algorithms - i_BM],

[GS], [KMP], [KR] (a randomized algorithm) and [Y], parallel algorithms [G] and

[V]. The problem has a strong pragmatical flavor. In practice, we often need to

analize situations where the data is not completely reliable. SpecLfically, con-

sider a situation where the strings which are the input for our problem contain

errors and we still need to find cdl possible occurrences of the pattern in the

text as m reality. Assuming some bound on the number of errors would clearly

imply our problem.

We present an algorithm for string matching with k m.i3matches which runs in

time 0{k[mlQgm. + n)) on a random-access-machine (R^VI) [AHU].

After all the results m the present paper have been achieved, A. Slisenko

has brought to our attention the paper [I] in which another algorithm for the

1. Department of Computer Science, School of Matherr.aticai Sciences, Tei Av.v university

2. Departrr.ent of Computer Science, Courant Institute of Mathematicai Sciences, N'ew York

University and (present address) Department of Com.puter Science, School of Mathematical

Sciences, Tei Aviv University. The research of this author was sutiTiortad by XSF grant XS?-

DCa-8318874 and OX?, gran: >r0014-85-K-OO46.

-2-

same problem has been given. Ivanov claims that his algorithm runs in time

0{f {k){n+m)), where f {k) is a function ot k . f (k) is described by a combina-

tion of two intricate recursive inequalities. No additional hints regarding the

behaviour ol f {k) were found in his paper. We were unable to solve these mequli-

ties. However, we managed to show that f (k) is bounded from below by 2*^ for

every positive integer k. It might be that f {k) grows even subtantially faster

than 2*. His algorithm runs faster than ours only when k is very small and m

and n are almost of the same order of magnitude. In all other cases, our algo-

rithm is faster. An even more important advantage of our algorithm is that it is

simple and intuitive while Ivanov' s algorithm is very complicated. ( Its descrip-

tion needed over 40 journal pages).

II. ANALYSIS OF THE TEXT

Our algorithm has two parts. In the first part the pattern is analyzed. The

outcome of this analysis is used m the second part for analyzing the text. The

next section describes the first part. The present section is devoted to the

second part. We show how to use the results of the pattern analysis in order to

find all occurrences of the pattern m the text with at most k mismatches.

The input to the text analysis consists of the following:

a) The patteren. An array 4 = a^ . . . , a^.

b) The text. An array T = t ^, . . . , t^.

c) The output of the pattern analysis. A two dimentional array

PAT -MISMATCH [I ttl -1;1,...,2A: + l]. Where, row i of the array

{FAT -MIS MATCH {i.l) PAT -MISMATCH {i. 2k + 1)). contains the 2A: + 1

first locations in which cl^^.^, . . , , a^ has different sjnnbols than

ai a^_i . (PAT -MISMATCH [i .v) =/ means that a^+f ^ a.j and / is

the mismatch number v from left to right).

If there aire only c < 2,fc-i-l mismatches between c^ + i, . . . .a^ and

dj, , , . ,o.jp.-\ ^'ve enter the default value m +1 from location c -Hi on. That is,

PAT -MISMATCH [i.c +1) = PAT-MISMATCH[i .Zk +1) = tth-:.

The text is analyzed into the array TEXT -MISMATCH [Q n-m;: k + -].

roiiovving 1.1J.S i,eA.v. anaivsii, r^" i oi. Lij.g a.r. a.;/ i^iiivii â€” .m ^ .

N

â€¢

u

XI

â€” 1

tp

1

C rS^

Â«

â€¢rH

E-i

M XI

-P -P

H

U) -H

U

s

in

+J

cu

C tJi

s

Q) C

O

â€¢rH -rH

â€¢rH u

TECHNICAL REPORT

EFFICIENT STRING MATCHING WITH k MISMATCHES

Gad M, Landau

Uzi Vishkin^

1

June 1985

Technical Report #167

NEW YORK UNIVERSITY

Department of Computer Science

Courant Institute of Mathematical Sciences

251 MERCER STREET, NEW YORK, N.Y. 10012

i^T-M p^t'

Â«â– â„¢ J>f4Â»^

wrw

DC C.

CO â€”

EC -f

o â€” ^ -â€¢

>- v- oo

EFFICIENT STRING MATCHING WITH k MISMATCHES

Gad M. Landau^

Uzi Vishkin^

June 1985

Technical Report #167

1 De^ar-Tient of Comi:uter Science, School of Me.Lherr.a-.ica-: Sciences, :e.\v:v ..niversity

J- ^"4:^.^-. o: Com.u-.er Science, Courant institute of Hatnen.atica. Sciences, vfew 'r orx

U-.v;;s;^"Ind (^reseat address) Department of Coni.uter Science. Sc.v.o; o^ Matne^ticai

Sciences,' Tel Av:v University. The research of this autnor ^'.?.3 supported :y N..- g, .nt .n..

DCS-8318874 and ONR grant .VOOl 4-a5-K-0046.

lEFFICIENT STEING MATCHING WITH k MISMATCHES

Ckid M. Landau*^

Uzi Viskkin*^

ABSTRACT

Given a text of length n, a pattern of length m and an integer

k, we present an algorithm for finding all occurrences of the pat-

tern in the text, each with at most k mismatches. The algorithm

runs in 0{k[mlQgTn + n) time.

1. INTEODUCTION

The problem of string matching xuith k misTnatchss is defined as follows.

Suppose we are given a text of length n , a pattern of length m and an integer k .

Find all occurrences of the pattern in the text with at most k location in which

the text and the pattern have different symbols. Note that the case fc - is the

extensively studied string matching problem. Let us mention a few notable algo-

rithms for the string matching problem: linear time serial algorithms - i_BM],

[GS], [KMP], [KR] (a randomized algorithm) and [Y], parallel algorithms [G] and

[V]. The problem has a strong pragmatical flavor. In practice, we often need to

analize situations where the data is not completely reliable. SpecLfically, con-

sider a situation where the strings which are the input for our problem contain

errors and we still need to find cdl possible occurrences of the pattern in the

text as m reality. Assuming some bound on the number of errors would clearly

imply our problem.

We present an algorithm for string matching with k m.i3matches which runs in

time 0{k[mlQgm. + n)) on a random-access-machine (R^VI) [AHU].

After all the results m the present paper have been achieved, A. Slisenko

has brought to our attention the paper [I] in which another algorithm for the

1. Department of Computer Science, School of Matherr.aticai Sciences, Tei Av.v university

2. Departrr.ent of Computer Science, Courant Institute of Mathematicai Sciences, N'ew York

University and (present address) Department of Com.puter Science, School of Mathematical

Sciences, Tei Aviv University. The research of this author was sutiTiortad by XSF grant XS?-

DCa-8318874 and OX?, gran: >r0014-85-K-OO46.

-2-

same problem has been given. Ivanov claims that his algorithm runs in time

0{f {k){n+m)), where f {k) is a function ot k . f (k) is described by a combina-

tion of two intricate recursive inequalities. No additional hints regarding the

behaviour ol f {k) were found in his paper. We were unable to solve these mequli-

ties. However, we managed to show that f (k) is bounded from below by 2*^ for

every positive integer k. It might be that f {k) grows even subtantially faster

than 2*. His algorithm runs faster than ours only when k is very small and m

and n are almost of the same order of magnitude. In all other cases, our algo-

rithm is faster. An even more important advantage of our algorithm is that it is

simple and intuitive while Ivanov' s algorithm is very complicated. ( Its descrip-

tion needed over 40 journal pages).

II. ANALYSIS OF THE TEXT

Our algorithm has two parts. In the first part the pattern is analyzed. The

outcome of this analysis is used m the second part for analyzing the text. The

next section describes the first part. The present section is devoted to the

second part. We show how to use the results of the pattern analysis in order to

find all occurrences of the pattern m the text with at most k mismatches.

The input to the text analysis consists of the following:

a) The patteren. An array 4 = a^ . . . , a^.

b) The text. An array T = t ^, . . . , t^.

c) The output of the pattern analysis. A two dimentional array

PAT -MISMATCH [I ttl -1;1,...,2A: + l]. Where, row i of the array

{FAT -MIS MATCH {i.l) PAT -MISMATCH {i. 2k + 1)). contains the 2A: + 1

first locations in which cl^^.^, . . , , a^ has different sjnnbols than

ai a^_i . (PAT -MISMATCH [i .v) =/ means that a^+f ^ a.j and / is

the mismatch number v from left to right).

If there aire only c < 2,fc-i-l mismatches between c^ + i, . . . .a^ and

dj, , , . ,o.jp.-\ ^'ve enter the default value m +1 from location c -Hi on. That is,

PAT -MISMATCH [i.c +1) = PAT-MISMATCH[i .Zk +1) = tth-:.

The text is analyzed into the array TEXT -MISMATCH [Q n-m;: k + -].

roiiovving 1.1J.S i,eA.v. anaivsii, r^" i oi. Lij.g a.r. a.;/ i^iiivii â€” .m ^ .

1

Online Library → Gad M Landau → Efficient string matching with k mismatches → online text (page 1 of 1)