Online Library → Gad M Landau → Efficient string matching with k differences → online text (page 1 of 2)

Font size

Computer Science Department

TECHNICAL REPORT

Landau, Gad M (-' '

Efficient string matching

with k differences.

EFFICIENT STRING MATCHING WITH k DIFFERENCES

BY

GAD M. LANDAU

UZI VISHKIN

OCTOBER 1985

TECHNICAL REPORT #186

NEW YORK UNIVERSITY

Department of Computer Science

Courant Institute of Mathematical Sciences

251 MERCER STREET, NEW YORK, NY. 10012

EFFICIENT STRING MATCHING WITH k DIFFERENCES

BY

GAD M. LANDAU

UZI VISHKIN

OCTOBER 1985

TECHNICAL REPORT #186

c

^

EFFICIENT STRING MATCHING WITH k DIFFERENCES

Cad M. Landau* 1

Department of Computer Science,

School of Mathematical Sciences,

Tel Aviv University,

Tel Aviv 69978, Israel.

Uzi Vishkin* 1 *

Department of Computer Science,

Courant Institute of Mathematical Sciences,

New York University

and (present address)

Department of Computer Science,

School of Mathematical Sciences,

Tel Aviv University.

ABSTRACT

Consider the string matching problem where differences between

characters of the pattern and characters of the text are allowed. Each

difference is due to either a mismatch between a character of the text and a

character of the pattern or a superfluous character in the text or a superfluous

character in the pattern. Given a text of length n, a pattern of length m and

an integer k, we present an algorithm for finding all occurrences of the

pattern in the text, each with at most k differences. It runs in 0(m + k?n)

time for alphabet whose size is fixed. For general input the algorithm

requires 0(mlogm + hn) time. In both cases the space requirement is O(m).

I. INTRODUCTION

In the known problem of pattern matching in strings (e.g., as discussed in [KMP]) we

are interested in finding all occurrences of a pattern in a text. In the present paper we

consider the following problem:

The string matching with k differences problem. (In short the k differences problem). Input.

Two strings: a pattern of length m and a text of length n and an integer IkaO, Find all

1) This research was supported by NSF grant NSF-DCR-8318874 and NSF-DCR-8413359, ONR grant

N0014-85-K-0046.

2) This research was supported by the Applied Mathematical Sciences subprogram of the Office of

Energy Research, U. S. Department of Energy, under contract number DE-AC02 76ER03077.

occurrences of the partem in the text with at most k differences.

Example. Let the text be abcdefghi , the pattern bxdyegh and it = 3. Let us see whether there

is an occurrence with *s k differences that starts at the second location of the text. For this

we propose the following correspondence between bcdefghi and bxdyegh. 1. b (of the text)

corresponds to b (of the pattern) 7. c to x. 3. d to d. 4. Nothing to y. 5. * to e. 6. / to

nothing. 7. g to g 8. .'; to h. The correspondence can be illustrated as follows,

b xdy e g h

bed efghi

In only three places the correspondence is between non-equal characters, implying that there

is an occurrence of the pattern at the second location of the text with 3 differences, as

required.

We distinguish three types of differences, (a) A character of the pattern corresponds to a

different character of the text. (Item 2 in the Example). In this case we say that there is a

mismatch between the two characters, (b) A character of the pattern corresponds to "no

character" in the text. (Item 4). (c) A character of the text corresponds to "no character" in

the pattern. (Item 6).

The problem of string matching with k mismatches, (in short the k mismatches problem),

was considered in [I] and [LV]. In the k mismatches problem we find all occurrences of the

pattern in the text with at most it differences of type (a). Differences of types (b) or (c) are

not allowed. [LV] give an 0(kmlog m + kn) algorithm for the problem. [I] gives a linear

time algorithm for fixed k. However, its running time grows very rapidly as function of it.

The k differences problem has a strong pragmatical flavor. In practice, we often need to

analyze situations where the data is not completely reliable. Specifically, consider a situation

where the strings which are the input for our problem contain errors and we still need to find

all possible occurrences of the pattern in the text as in reality. The errors may include a

character being replaced by another character, a character being omitted or a superfluous

character being inserted. Assuming some bound on the number of errors would clearly imply

our problems. Applications of our solution for the it differences problem in Molecular

Biology are discussed in [LVN]. [SK] give a comprehensive review of applications of the it

differences problem.

We give a new algorithm for the problem. The algorithm has two implementations.

The first runs in 0(m 2 + it 2 /!) time and 0{m 2 ) space for general input. The second runs in

0(m + ^n) time and O(m) space for alphabet whose size is fixed. For general input the

second implementation needs 0(mlogm + Jt^n) time and the same space. The algorithm is

designed for a random-access-machine (RAM) [AHU].

Our algorithm consists of a pattern analysis part to be followed by a text analysis.

In the first implementation we have achieved 0(m 2 ) time for the pattern analysis and O^n)

time for the text analysis.

In the second one we have actually achieved 0(m) time for the pattern analysis for alphabet

whose size is fixed and 0(mlogm) time for general input. While for the text analysis we

achieved O(lrn) time.

Besides its relative simplicity, our first implementation is not without merit even with respect

to the second implementation. Specifically, whenever the time for the text analysis

dominates the computation time, the running time of our first implementation is close to

optimal. There are a few realistic possibilities where this happens:

1. The same pattern has to be matched with different texts or the pattern is known in advance

and we have plenty of time to analyze it.

2. m is sufficiently small with respect to n . Specifically, m = o(kvn).

Perhaps surprisingly we were able to adopt *he conservative and simple framework of

[KMP] in the algorithm. That is, we first build 9 table based on analyst: of the pattern.

Then, we examine the text from left to right checking possible occurrences with respect to

one starting location (in the text) at each iteration. Besides the tables built in the pattern

analysis, the input to each iteration consists of the knowledge acquired in previous iterations.

The rightmost location in the text to which we arrived in a previous iteration is of particular

significance. Each iteration consists of manipulating this knowledge. And if necessary (till

this rightmost location we have not sufficient evidence to exclude the possibility of

occurrence), we proceed to investigate the text to the right of this rightmost location.

We use a further analogy to the [KMP] algorithm for the known string matching problem in

order to explain our contribution in the algorithm for the k differences problem. Consider

the following most trivial strings equality problem: Given two strings, find if they are

identical. Observing this problem and its immediate solution was clearly a step (even if it

was very minor) in devising string matching algorithms. Our presentation is strongly

motivated by this anajogy in the following sense. The presentation has two major steps.

In the first step we define an auxiliary problem which is analogous to the string equality

problem when the * differences problem is considered (instead of the string matching

problem). The algorithm for the auxiliary problem uses known techniques ([U]). Our

contribution is in providing the second major step. That is, we give an algorithm for the Jk

differences problem using the algorithm for the first step. The auxiliary problem of the first

major step is less obvious than the strings equality problem and provides an essential part of

our algorithm. However, we feel that our contribution, which is analogous to the whole

-4-

algorithm of [KMP], overcomes a more involved and general problem than in [KMP].

For the original string matching problem and the k mismatches problem, the definition

of the problems imply immediately algorithms which run in 0(jnn) time. [S] gives a simple

0(nm) time algorithm for the * differences problem. In his survey on future directions for

research in string matching, Z. Galil [G85] discusses the k mismatches problem, which seems

substantially easier than our present problem. He states that it is an open question whether

the naive algorithm for the k mismatches problem which takes 0{mn) time can be improved.

Note that the bound on the running time of our algorithm is much smaller than 0{nm). We

already noted that the case * = is the extensively studied string matching problem. There

are a few notable algorithms for the string matching problem: linear time serial algorithms -

[BM], [GS], [KMP], [KR] (a randomized algorithm) and [V], parallel algorithms - [G84] and

[V\. Note that none of these algorithms is suitable to cope with our problems.

[U85] presents an interesting algorithm for the k differences problem whose pattern

analysis takes exponential time. The algorithm runs in time 0(m\1\G + n) and requires

0((|S| + m)G) space, where |2| is the size of the alphabet and G = min(3 m , 2*|2|*m i+1 ).

Preprocessing of the pattern takes 0(m\2\G) time. Then analysis of the text takes 0(n) time

which is pretty impressive. However, the author himself seems to be aware that the space

and preprocessing time requirements make the algorithm impractical, in general. (For

comparison, the space requirement of our first implementation is 0(m 2 ) and of the second is

O(m)).

Section 2 gives an exceedingly simple algorithm for string matching with one different.

Section 3 presents the pattern analysis part of the algorithm for the * differences problem.

Section 4 presents the text analysis part. Both sections 3 and 4 discuss the first

implementation of the algorithm only. The second implementation is given in section 5.

II. PATTERN MATCHING WITH 1 DIFFERENCE IN LINEAR TIME

Below, we describe a simple algorithm for finding all occurrences of the pattern in the

text with at most one difference. [ML] showed how to apply the [KMP] algorithm for the

following slightly modified string matching problem. Find for each location of the text

whether an occurrence of the pattern starts at it. For each location in which there is no

occurrence of the pattern find the leftmost character in which there is a mismatch.

The algorithm for pattern matching with 1 difference has three steps.

1. Run this modified string matching algorithm on the input text and pattern. Suppose that no

occurrence of the pattern starts at character t, of the text. Denote by /(/') the leftmost

-5-

mismatched character of text for the pattern starting at t,. The modified string matching

algorithm would find /(Â»')â€¢

2. Inverse the text (if it was r,...,fâ€ž let it be *â€ž,*â€ž_; fj and the pattern and run the

same algorithm. Suppose that no occurrence of the pattern starts at location t, of the text.

Denote by r(i') the rightmost mismatched character of text for the pattern starting at t t . The

modified string matching algorithm would find r(i).

Recall the three types of differences as defined in the Introduction.

3. Suppose that no occurrence of the pattern starts at location f, of the text.

(i) For all i such that /(Â»') = r(i') , we can conclude that there is an occurrence of the pattern

starting at t t with one mismatch (difference of type (a)).

(ii) For all i such that /(Â«) > r(i-l), we can conclude that there is an occurrence of the

pattern starting at t, with at most one difference of type (b).

(iii) For all i such that /(/) & r(/ + l), we can conclude that there is an occurrence of the

pattern starting at r, with at most one difference of type (c).

Remarks. 1. The running time of this algorithm is 0(n + m). 2. We leave it to the

interested reader to find how to generalize this algorithm for a wider definition of this

problem where one "successive chunk of differences" is allowed.

III. STRING MATCHING WITH Jt DIFFERENCES - PATTERN ANALYSIS (ilrst

implementation).

The algorithm has two parts:

(a) The pattern analysis. We build a table which is based on analysis of the pattern.

(b) The text analysis. We show how to find efficiently all the occurrences of the pattern in

the text with at most k differences, using the result of the pattern analysis.

Analysis of the pattern

The input to the pattern analysis is the pattern, which is given as an array

A = a,, . . . ,a m . The output of the pattern analysis is a two dimensional array

MAX-LENGTH[0,...,m-lfl,...,m-l]. MAX - LENGTH (ij)=f means that

, +1 a, +/ = a y+ i Â«/+/, and a /+/+1 * Â«/+/+i- In words, consider laying the

a

suffix of the pattern starting at a l+l over the suffix of the pattern starting at ay +1 .

MAX - LENGTH (ij) is the longest match of prefixes between these two suffixes.

-6-

Define the pair (ij) ,0s ijsm-1 to be on diagonal d if i-j = d where possible

values of d are -(m-1) s d ss m-1. We compute each diagonal > . >w +i]i

where D l} is the number of differences between a l , . . . , a, and b x , . . . , b,.

It should be obvious that if D mJ s k, for at least one I, w-t s / s m Hit, then the answer to

the auxiliary problem is yes.

The following algorithm computes the matrix Dm^,p t/K+I A

Initialization D 0fi := 0.

for all I, 1 , := i ,.

for i:=l to m do

for l:= 1 to m + k do

D t y.= min (D,_ iJ +l, D /; _ : +l, D l . lJ . i if a, = b s or D,. it! ^^l otherwise).

(D S j is the minimum of three numbers. These numbers are obtained from the

predecessors of D tJ on its column, row and diagonal, rt-spectively).

This algorithm clearly runs in 0(m : ) time.

Q(km) time algorithm for the auxiliary problem. (Due to [U]).

Diagonal d of the matrix consists of all Dn's such thai lâ€”i = d.

Lemma 1 [U]. For every i,l, D,j - D^^-j, is either zero or one.

Lemma 1 implies that we can store the information of the matrix in a more compact way.

For a number of differences e and a diagonal d, let L djl denote the largest row i such that

D, , = e and D :; is on diagonal d. Note, that this implies that there are e differences

between a : a L ^ and fc,, . . . ,b L ^+^ ^d a L ^ +l 4- b L ^ +d+l .

Corollary. For our auxiliary problem we need only values of L d4 , where e and d satisfy

e â– Â£â– k and \d\ s e .

Proof, e s k is obvious. The initial values of the matrix and Lemma 1 imply that all the D t ,

on a diagonal d are a \d\ and therefore given a number of differences e we need only values

of L dt where \d\ s e.

The answer to the auxiliary problem is yes if one of the L d ., (\d\ S e as k), equals m.

Given d and e we describe how to compute L dt using its definition. That is , L dt is the

largest row such that D,j = e, and D i} is on the diagonal d. In the above 0(m 2 ) time

algorithm the assignment of e into D l} was done using one (or more) of the following data:

(a) D l - 1 j_ l (the predecessor of D tJ on the diagonal d) is e-1 and a, # b,.

(b) D;;-; (the predecessor of D tJ on row i which is also on the diagonal "below" ef) is eâ€” 1.

(c) Â£>/_!; (the predecessor of D ;; on column / which is also on the diagonal "above" ,.

This implies that we can start from D tJ and follow its predecessors on diagonal d by

-8-

possibility (d) till the first time one (or more) of possibilities (a) (b) and (c) occur.

The 0(km) time algorithm for the auxiliary problem is given below. The reader is invited to

convince oneself that the initialization step (Instruction 1) is done in a way which enables the

computation of the L d4 's in the subsequent instructions. Instructions 2-6 compute L dt 's

(|rf| Â£ e s k) by "inversing" the above description for computing the L d y* from their

definition. L dit -\, ^-i, e -iÂ» an ^ L d+lit .. : are used to initialize the variable row (Instruction

3), which is then increased by one at a time till it hits the correct value of L d , (Instruction 4).

The 0(km) time algorithm for the auxiliary problem

[1] Initialization

ford:=-(k+l)to(k+l)do

L d,\d\-2 : ~ -x ;

if d<

then i rfi |rf|_i:= \d\-l;

else L^itf* -1 ;

[2] for e:~0 iokdu

for d:=-e to e do

[3] row : - max[{L d , _,+ !), (L^-J, (I* +1 ,-i+l)].

[4] while a^+y = b mv+l+d do

row : = row + 1 .

[5]Â£ j), S[j and MAX-LENGTH do not help us

any more and we apply (the old) Instruction 4 as in the computation of the auxiliary

problem.

We finish this overview of iteration i by showing how to apply the sequence S^ and

MAX -LENGTH to obtain these jumps. The while loop of Instruction 4 looks for the longest

match between prefixes of some suffix of the text t t+row+d + u ... , where

j + 1 s i+row + d sj and some suffix of the pattern a^.^,... . We explain how to find

the maximum w such that a row+l a^^ equals t t+row+d+1 t l+raw+d + w . Suppose

that according to S,j the substring r /+TOH , +rf+1 f ;+roW+/ matches c c+1 , . . . ,a c+f for

some index c of the pattern and c +f is the maximal index of the pattern for which this match

holds. We can find those c and / using the fact that for each t h+1 (i s h < j) there exists a

triple (p^c^JJ in S tJ such that p 1 s*sp 1 + / 1 . ((p^c^J covers f A + 1 ). For the

- 10-

computation of w we need to break into the following cases:

Case(a). /2l, M AX- LENGTH (c, row) gives the maximal number g such that

a c+: a c + t equals a row+l Â«Â«*+,â– Case ( a ) nas ^o subcases.

Cose{a\). f*g. It is easy to see that here t t+m , +d+x , '/ +row+ rf +mtaVrf) =

flrow+i "nw+mh^i and t l+row+d + min(fig) + l * o w+mh(r>f)+1 . Therefore, we assign

w:= min(f,g).

Case(a2). f=g. This implies r f+rw+d+] t l+row+d + f = a rm+i c^^ but does not

reveal whether t l+row+a+ f +1 equals a njw + f+l or not. Therefore, we assign row:= row+f and

apply again the present case analysis accumulating this "jump" over / symbols into w .

Case (Â£Â»)./= 0. Case (b) has two subcases.

Case(b\). t l+rvw+d+1 * a rcw+i . Hence, we assign >v:= 0.

Case(b2). t !+royv+d+l = a, w+ i. Therefort, we assign row:= row + 1, and we apply again the

present case analysis accumulating this propagation of 1 into w.

-11-

The text analysis algorithm

;:=0;

for i: = to n-m+k do

begin

[1] Proper initialization (as in the 0(km) algorithm for the auxiliary problem)

[2] for e:=0 to k do

for d:=-e to e do

[3] row := mox[(L rf> ,_ 1 ,+ l), (L^^J, (I d+lf ,_i+l)].

[4. new] while i+row + d+1 s j do

[4.new.l] take from S tJ the triple that "covers" r /+rwv+rf+1 . Derive from

this triple the indices cj such that t t+mt+d+i ,f, +/w+4+/

= fl c+1 , . . . ,a c +f (f, +roH , +

TECHNICAL REPORT

Landau, Gad M (-' '

Efficient string matching

with k differences.

EFFICIENT STRING MATCHING WITH k DIFFERENCES

BY

GAD M. LANDAU

UZI VISHKIN

OCTOBER 1985

TECHNICAL REPORT #186

NEW YORK UNIVERSITY

Department of Computer Science

Courant Institute of Mathematical Sciences

251 MERCER STREET, NEW YORK, NY. 10012

EFFICIENT STRING MATCHING WITH k DIFFERENCES

BY

GAD M. LANDAU

UZI VISHKIN

OCTOBER 1985

TECHNICAL REPORT #186

c

^

EFFICIENT STRING MATCHING WITH k DIFFERENCES

Cad M. Landau* 1

Department of Computer Science,

School of Mathematical Sciences,

Tel Aviv University,

Tel Aviv 69978, Israel.

Uzi Vishkin* 1 *

Department of Computer Science,

Courant Institute of Mathematical Sciences,

New York University

and (present address)

Department of Computer Science,

School of Mathematical Sciences,

Tel Aviv University.

ABSTRACT

Consider the string matching problem where differences between

characters of the pattern and characters of the text are allowed. Each

difference is due to either a mismatch between a character of the text and a

character of the pattern or a superfluous character in the text or a superfluous

character in the pattern. Given a text of length n, a pattern of length m and

an integer k, we present an algorithm for finding all occurrences of the

pattern in the text, each with at most k differences. It runs in 0(m + k?n)

time for alphabet whose size is fixed. For general input the algorithm

requires 0(mlogm + hn) time. In both cases the space requirement is O(m).

I. INTRODUCTION

In the known problem of pattern matching in strings (e.g., as discussed in [KMP]) we

are interested in finding all occurrences of a pattern in a text. In the present paper we

consider the following problem:

The string matching with k differences problem. (In short the k differences problem). Input.

Two strings: a pattern of length m and a text of length n and an integer IkaO, Find all

1) This research was supported by NSF grant NSF-DCR-8318874 and NSF-DCR-8413359, ONR grant

N0014-85-K-0046.

2) This research was supported by the Applied Mathematical Sciences subprogram of the Office of

Energy Research, U. S. Department of Energy, under contract number DE-AC02 76ER03077.

occurrences of the partem in the text with at most k differences.

Example. Let the text be abcdefghi , the pattern bxdyegh and it = 3. Let us see whether there

is an occurrence with *s k differences that starts at the second location of the text. For this

we propose the following correspondence between bcdefghi and bxdyegh. 1. b (of the text)

corresponds to b (of the pattern) 7. c to x. 3. d to d. 4. Nothing to y. 5. * to e. 6. / to

nothing. 7. g to g 8. .'; to h. The correspondence can be illustrated as follows,

b xdy e g h

bed efghi

In only three places the correspondence is between non-equal characters, implying that there

is an occurrence of the pattern at the second location of the text with 3 differences, as

required.

We distinguish three types of differences, (a) A character of the pattern corresponds to a

different character of the text. (Item 2 in the Example). In this case we say that there is a

mismatch between the two characters, (b) A character of the pattern corresponds to "no

character" in the text. (Item 4). (c) A character of the text corresponds to "no character" in

the pattern. (Item 6).

The problem of string matching with k mismatches, (in short the k mismatches problem),

was considered in [I] and [LV]. In the k mismatches problem we find all occurrences of the

pattern in the text with at most it differences of type (a). Differences of types (b) or (c) are

not allowed. [LV] give an 0(kmlog m + kn) algorithm for the problem. [I] gives a linear

time algorithm for fixed k. However, its running time grows very rapidly as function of it.

The k differences problem has a strong pragmatical flavor. In practice, we often need to

analyze situations where the data is not completely reliable. Specifically, consider a situation

where the strings which are the input for our problem contain errors and we still need to find

all possible occurrences of the pattern in the text as in reality. The errors may include a

character being replaced by another character, a character being omitted or a superfluous

character being inserted. Assuming some bound on the number of errors would clearly imply

our problems. Applications of our solution for the it differences problem in Molecular

Biology are discussed in [LVN]. [SK] give a comprehensive review of applications of the it

differences problem.

We give a new algorithm for the problem. The algorithm has two implementations.

The first runs in 0(m 2 + it 2 /!) time and 0{m 2 ) space for general input. The second runs in

0(m + ^n) time and O(m) space for alphabet whose size is fixed. For general input the

second implementation needs 0(mlogm + Jt^n) time and the same space. The algorithm is

designed for a random-access-machine (RAM) [AHU].

Our algorithm consists of a pattern analysis part to be followed by a text analysis.

In the first implementation we have achieved 0(m 2 ) time for the pattern analysis and O^n)

time for the text analysis.

In the second one we have actually achieved 0(m) time for the pattern analysis for alphabet

whose size is fixed and 0(mlogm) time for general input. While for the text analysis we

achieved O(lrn) time.

Besides its relative simplicity, our first implementation is not without merit even with respect

to the second implementation. Specifically, whenever the time for the text analysis

dominates the computation time, the running time of our first implementation is close to

optimal. There are a few realistic possibilities where this happens:

1. The same pattern has to be matched with different texts or the pattern is known in advance

and we have plenty of time to analyze it.

2. m is sufficiently small with respect to n . Specifically, m = o(kvn).

Perhaps surprisingly we were able to adopt *he conservative and simple framework of

[KMP] in the algorithm. That is, we first build 9 table based on analyst: of the pattern.

Then, we examine the text from left to right checking possible occurrences with respect to

one starting location (in the text) at each iteration. Besides the tables built in the pattern

analysis, the input to each iteration consists of the knowledge acquired in previous iterations.

The rightmost location in the text to which we arrived in a previous iteration is of particular

significance. Each iteration consists of manipulating this knowledge. And if necessary (till

this rightmost location we have not sufficient evidence to exclude the possibility of

occurrence), we proceed to investigate the text to the right of this rightmost location.

We use a further analogy to the [KMP] algorithm for the known string matching problem in

order to explain our contribution in the algorithm for the k differences problem. Consider

the following most trivial strings equality problem: Given two strings, find if they are

identical. Observing this problem and its immediate solution was clearly a step (even if it

was very minor) in devising string matching algorithms. Our presentation is strongly

motivated by this anajogy in the following sense. The presentation has two major steps.

In the first step we define an auxiliary problem which is analogous to the string equality

problem when the * differences problem is considered (instead of the string matching

problem). The algorithm for the auxiliary problem uses known techniques ([U]). Our

contribution is in providing the second major step. That is, we give an algorithm for the Jk

differences problem using the algorithm for the first step. The auxiliary problem of the first

major step is less obvious than the strings equality problem and provides an essential part of

our algorithm. However, we feel that our contribution, which is analogous to the whole

-4-

algorithm of [KMP], overcomes a more involved and general problem than in [KMP].

For the original string matching problem and the k mismatches problem, the definition

of the problems imply immediately algorithms which run in 0(jnn) time. [S] gives a simple

0(nm) time algorithm for the * differences problem. In his survey on future directions for

research in string matching, Z. Galil [G85] discusses the k mismatches problem, which seems

substantially easier than our present problem. He states that it is an open question whether

the naive algorithm for the k mismatches problem which takes 0{mn) time can be improved.

Note that the bound on the running time of our algorithm is much smaller than 0{nm). We

already noted that the case * = is the extensively studied string matching problem. There

are a few notable algorithms for the string matching problem: linear time serial algorithms -

[BM], [GS], [KMP], [KR] (a randomized algorithm) and [V], parallel algorithms - [G84] and

[V\. Note that none of these algorithms is suitable to cope with our problems.

[U85] presents an interesting algorithm for the k differences problem whose pattern

analysis takes exponential time. The algorithm runs in time 0(m\1\G + n) and requires

0((|S| + m)G) space, where |2| is the size of the alphabet and G = min(3 m , 2*|2|*m i+1 ).

Preprocessing of the pattern takes 0(m\2\G) time. Then analysis of the text takes 0(n) time

which is pretty impressive. However, the author himself seems to be aware that the space

and preprocessing time requirements make the algorithm impractical, in general. (For

comparison, the space requirement of our first implementation is 0(m 2 ) and of the second is

O(m)).

Section 2 gives an exceedingly simple algorithm for string matching with one different.

Section 3 presents the pattern analysis part of the algorithm for the * differences problem.

Section 4 presents the text analysis part. Both sections 3 and 4 discuss the first

implementation of the algorithm only. The second implementation is given in section 5.

II. PATTERN MATCHING WITH 1 DIFFERENCE IN LINEAR TIME

Below, we describe a simple algorithm for finding all occurrences of the pattern in the

text with at most one difference. [ML] showed how to apply the [KMP] algorithm for the

following slightly modified string matching problem. Find for each location of the text

whether an occurrence of the pattern starts at it. For each location in which there is no

occurrence of the pattern find the leftmost character in which there is a mismatch.

The algorithm for pattern matching with 1 difference has three steps.

1. Run this modified string matching algorithm on the input text and pattern. Suppose that no

occurrence of the pattern starts at character t, of the text. Denote by /(/') the leftmost

-5-

mismatched character of text for the pattern starting at t,. The modified string matching

algorithm would find /(Â»')â€¢

2. Inverse the text (if it was r,...,fâ€ž let it be *â€ž,*â€ž_; fj and the pattern and run the

same algorithm. Suppose that no occurrence of the pattern starts at location t, of the text.

Denote by r(i') the rightmost mismatched character of text for the pattern starting at t t . The

modified string matching algorithm would find r(i).

Recall the three types of differences as defined in the Introduction.

3. Suppose that no occurrence of the pattern starts at location f, of the text.

(i) For all i such that /(Â»') = r(i') , we can conclude that there is an occurrence of the pattern

starting at t t with one mismatch (difference of type (a)).

(ii) For all i such that /(Â«) > r(i-l), we can conclude that there is an occurrence of the

pattern starting at t, with at most one difference of type (b).

(iii) For all i such that /(/) & r(/ + l), we can conclude that there is an occurrence of the

pattern starting at r, with at most one difference of type (c).

Remarks. 1. The running time of this algorithm is 0(n + m). 2. We leave it to the

interested reader to find how to generalize this algorithm for a wider definition of this

problem where one "successive chunk of differences" is allowed.

III. STRING MATCHING WITH Jt DIFFERENCES - PATTERN ANALYSIS (ilrst

implementation).

The algorithm has two parts:

(a) The pattern analysis. We build a table which is based on analysis of the pattern.

(b) The text analysis. We show how to find efficiently all the occurrences of the pattern in

the text with at most k differences, using the result of the pattern analysis.

Analysis of the pattern

The input to the pattern analysis is the pattern, which is given as an array

A = a,, . . . ,a m . The output of the pattern analysis is a two dimensional array

MAX-LENGTH[0,...,m-lfl,...,m-l]. MAX - LENGTH (ij)=f means that

, +1 a, +/ = a y+ i Â«/+/, and a /+/+1 * Â«/+/+i- In words, consider laying the

a

suffix of the pattern starting at a l+l over the suffix of the pattern starting at ay +1 .

MAX - LENGTH (ij) is the longest match of prefixes between these two suffixes.

-6-

Define the pair (ij) ,0s ijsm-1 to be on diagonal d if i-j = d where possible

values of d are -(m-1) s d ss m-1. We compute each diagonal > . >w +i]i

where D l} is the number of differences between a l , . . . , a, and b x , . . . , b,.

It should be obvious that if D mJ s k, for at least one I, w-t s / s m Hit, then the answer to

the auxiliary problem is yes.

The following algorithm computes the matrix Dm^,p t/K+I A

Initialization D 0fi := 0.

for all I, 1 , := i ,.

for i:=l to m do

for l:= 1 to m + k do

D t y.= min (D,_ iJ +l, D /; _ : +l, D l . lJ . i if a, = b s or D,. it! ^^l otherwise).

(D S j is the minimum of three numbers. These numbers are obtained from the

predecessors of D tJ on its column, row and diagonal, rt-spectively).

This algorithm clearly runs in 0(m : ) time.

Q(km) time algorithm for the auxiliary problem. (Due to [U]).

Diagonal d of the matrix consists of all Dn's such thai lâ€”i = d.

Lemma 1 [U]. For every i,l, D,j - D^^-j, is either zero or one.

Lemma 1 implies that we can store the information of the matrix in a more compact way.

For a number of differences e and a diagonal d, let L djl denote the largest row i such that

D, , = e and D :; is on diagonal d. Note, that this implies that there are e differences

between a : a L ^ and fc,, . . . ,b L ^+^ ^d a L ^ +l 4- b L ^ +d+l .

Corollary. For our auxiliary problem we need only values of L d4 , where e and d satisfy

e â– Â£â– k and \d\ s e .

Proof, e s k is obvious. The initial values of the matrix and Lemma 1 imply that all the D t ,

on a diagonal d are a \d\ and therefore given a number of differences e we need only values

of L dt where \d\ s e.

The answer to the auxiliary problem is yes if one of the L d ., (\d\ S e as k), equals m.

Given d and e we describe how to compute L dt using its definition. That is , L dt is the

largest row such that D,j = e, and D i} is on the diagonal d. In the above 0(m 2 ) time

algorithm the assignment of e into D l} was done using one (or more) of the following data:

(a) D l - 1 j_ l (the predecessor of D tJ on the diagonal d) is e-1 and a, # b,.

(b) D;;-; (the predecessor of D tJ on row i which is also on the diagonal "below" ef) is eâ€” 1.

(c) Â£>/_!; (the predecessor of D ;; on column / which is also on the diagonal "above" ,.

This implies that we can start from D tJ and follow its predecessors on diagonal d by

-8-

possibility (d) till the first time one (or more) of possibilities (a) (b) and (c) occur.

The 0(km) time algorithm for the auxiliary problem is given below. The reader is invited to

convince oneself that the initialization step (Instruction 1) is done in a way which enables the

computation of the L d4 's in the subsequent instructions. Instructions 2-6 compute L dt 's

(|rf| Â£ e s k) by "inversing" the above description for computing the L d y* from their

definition. L dit -\, ^-i, e -iÂ» an ^ L d+lit .. : are used to initialize the variable row (Instruction

3), which is then increased by one at a time till it hits the correct value of L d , (Instruction 4).

The 0(km) time algorithm for the auxiliary problem

[1] Initialization

ford:=-(k+l)to(k+l)do

L d,\d\-2 : ~ -x ;

if d<

then i rfi |rf|_i:= \d\-l;

else L^itf* -1 ;

[2] for e:~0 iokdu

for d:=-e to e do

[3] row : - max[{L d , _,+ !), (L^-J, (I* +1 ,-i+l)].

[4] while a^+y = b mv+l+d do

row : = row + 1 .

[5]Â£ j), S[j and MAX-LENGTH do not help us

any more and we apply (the old) Instruction 4 as in the computation of the auxiliary

problem.

We finish this overview of iteration i by showing how to apply the sequence S^ and

MAX -LENGTH to obtain these jumps. The while loop of Instruction 4 looks for the longest

match between prefixes of some suffix of the text t t+row+d + u ... , where

j + 1 s i+row + d sj and some suffix of the pattern a^.^,... . We explain how to find

the maximum w such that a row+l a^^ equals t t+row+d+1 t l+raw+d + w . Suppose

that according to S,j the substring r /+TOH , +rf+1 f ;+roW+/ matches c c+1 , . . . ,a c+f for

some index c of the pattern and c +f is the maximal index of the pattern for which this match

holds. We can find those c and / using the fact that for each t h+1 (i s h < j) there exists a

triple (p^c^JJ in S tJ such that p 1 s*sp 1 + / 1 . ((p^c^J covers f A + 1 ). For the

- 10-

computation of w we need to break into the following cases:

Case(a). /2l, M AX- LENGTH (c, row) gives the maximal number g such that

a c+: a c + t equals a row+l Â«Â«*+,â– Case ( a ) nas ^o subcases.

Cose{a\). f*g. It is easy to see that here t t+m , +d+x , '/ +row+ rf +mtaVrf) =

flrow+i "nw+mh^i and t l+row+d + min(fig) + l * o w+mh(r>f)+1 . Therefore, we assign

w:= min(f,g).

Case(a2). f=g. This implies r f+rw+d+] t l+row+d + f = a rm+i c^^ but does not

reveal whether t l+row+a+ f +1 equals a njw + f+l or not. Therefore, we assign row:= row+f and

apply again the present case analysis accumulating this "jump" over / symbols into w .

Case (Â£Â»)./= 0. Case (b) has two subcases.

Case(b\). t l+rvw+d+1 * a rcw+i . Hence, we assign >v:= 0.

Case(b2). t !+royv+d+l = a, w+ i. Therefort, we assign row:= row + 1, and we apply again the

present case analysis accumulating this propagation of 1 into w.

-11-

The text analysis algorithm

;:=0;

for i: = to n-m+k do

begin

[1] Proper initialization (as in the 0(km) algorithm for the auxiliary problem)

[2] for e:=0 to k do

for d:=-e to e do

[3] row := mox[(L rf> ,_ 1 ,+ l), (L^^J, (I d+lf ,_i+l)].

[4. new] while i+row + d+1 s j do

[4.new.l] take from S tJ the triple that "covers" r /+rwv+rf+1 . Derive from

this triple the indices cj such that t t+mt+d+i ,f, +/w+4+/

= fl c+1 , . . . ,a c +f (f, +roH , +

1 2

Online Library → Gad M Landau → Efficient string matching with k differences → online text (page 1 of 2)