# Efficient string matching with k differences online

. (page 1 of 2)
Online LibraryGad M LandauEfficient string matching with k differences → online text (page 1 of 2)
Font size Computer Science Department

TECHNICAL REPORT

Efficient string matching
with k differences.

EFFICIENT STRING MATCHING WITH k DIFFERENCES

BY

UZI VISHKIN

OCTOBER 1985
TECHNICAL REPORT #186

NEW YORK UNIVERSITY

Department of Computer Science
Courant Institute of Mathematical Sciences

251 MERCER STREET, NEW YORK, NY. 10012

EFFICIENT STRING MATCHING WITH k DIFFERENCES

BY

UZI VISHKIN

OCTOBER 1985
TECHNICAL REPORT #186

c

^

EFFICIENT STRING MATCHING WITH k DIFFERENCES

Department of Computer Science,

School of Mathematical Sciences,

Tel Aviv University,

Tel Aviv 69978, Israel.

Uzi Vishkin* 1 *

Department of Computer Science,
Courant Institute of Mathematical Sciences,

New York University

Department of Computer Science,
School of Mathematical Sciences,

Tel Aviv University.

ABSTRACT

Consider the string matching problem where differences between
characters of the pattern and characters of the text are allowed. Each
difference is due to either a mismatch between a character of the text and a
character of the pattern or a superfluous character in the text or a superfluous
character in the pattern. Given a text of length n, a pattern of length m and
an integer k, we present an algorithm for finding all occurrences of the
pattern in the text, each with at most k differences. It runs in 0(m + k?n)
time for alphabet whose size is fixed. For general input the algorithm
requires 0(mlogm + hn) time. In both cases the space requirement is O(m).

I. INTRODUCTION

In the known problem of pattern matching in strings (e.g., as discussed in [KMP]) we
are interested in finding all occurrences of a pattern in a text. In the present paper we
consider the following problem:

The string matching with k differences problem. (In short the k differences problem). Input.
Two strings: a pattern of length m and a text of length n and an integer IkaO, Find all

1) This research was supported by NSF grant NSF-DCR-8318874 and NSF-DCR-8413359, ONR grant
N0014-85-K-0046.

2) This research was supported by the Applied Mathematical Sciences subprogram of the Office of
Energy Research, U. S. Department of Energy, under contract number DE-AC02 76ER03077.

occurrences of the partem in the text with at most k differences.

Example. Let the text be abcdefghi , the pattern bxdyegh and it = 3. Let us see whether there
is an occurrence with *s k differences that starts at the second location of the text. For this
we propose the following correspondence between bcdefghi and bxdyegh. 1. b (of the text)
corresponds to b (of the pattern) 7. c to x. 3. d to d. 4. Nothing to y. 5. * to e. 6. / to
nothing. 7. g to g 8. .'; to h. The correspondence can be illustrated as follows,

b xdy e g h
bed efghi

In only three places the correspondence is between non-equal characters, implying that there
is an occurrence of the pattern at the second location of the text with 3 differences, as
required.

We distinguish three types of differences, (a) A character of the pattern corresponds to a
different character of the text. (Item 2 in the Example). In this case we say that there is a
mismatch between the two characters, (b) A character of the pattern corresponds to "no
character" in the text. (Item 4). (c) A character of the text corresponds to "no character" in
the pattern. (Item 6).

The problem of string matching with k mismatches, (in short the k mismatches problem),
was considered in [I] and [LV]. In the k mismatches problem we find all occurrences of the
pattern in the text with at most it differences of type (a). Differences of types (b) or (c) are
not allowed. [LV] give an 0(kmlog m + kn) algorithm for the problem. [I] gives a linear
time algorithm for fixed k. However, its running time grows very rapidly as function of it.

The k differences problem has a strong pragmatical flavor. In practice, we often need to
analyze situations where the data is not completely reliable. Specifically, consider a situation
where the strings which are the input for our problem contain errors and we still need to find
all possible occurrences of the pattern in the text as in reality. The errors may include a
character being replaced by another character, a character being omitted or a superfluous
character being inserted. Assuming some bound on the number of errors would clearly imply
our problems. Applications of our solution for the it differences problem in Molecular
Biology are discussed in [LVN]. [SK] give a comprehensive review of applications of the it
differences problem.

We give a new algorithm for the problem. The algorithm has two implementations.
The first runs in 0(m 2 + it 2 /!) time and 0{m 2 ) space for general input. The second runs in
0(m + ^n) time and O(m) space for alphabet whose size is fixed. For general input the
second implementation needs 0(mlogm + Jt^n) time and the same space. The algorithm is

designed for a random-access-machine (RAM) [AHU].

Our algorithm consists of a pattern analysis part to be followed by a text analysis.

In the first implementation we have achieved 0(m 2 ) time for the pattern analysis and O^n)

time for the text analysis.

In the second one we have actually achieved 0(m) time for the pattern analysis for alphabet

whose size is fixed and 0(mlogm) time for general input. While for the text analysis we

achieved O(lrn) time.

Besides its relative simplicity, our first implementation is not without merit even with respect

to the second implementation. Specifically, whenever the time for the text analysis

dominates the computation time, the running time of our first implementation is close to

optimal. There are a few realistic possibilities where this happens:

1. The same pattern has to be matched with different texts or the pattern is known in advance
and we have plenty of time to analyze it.

2. m is sufficiently small with respect to n . Specifically, m = o(kvn).

Perhaps surprisingly we were able to adopt *he conservative and simple framework of
[KMP] in the algorithm. That is, we first build 9 table based on analyst: of the pattern.
Then, we examine the text from left to right checking possible occurrences with respect to
one starting location (in the text) at each iteration. Besides the tables built in the pattern
analysis, the input to each iteration consists of the knowledge acquired in previous iterations.
The rightmost location in the text to which we arrived in a previous iteration is of particular
significance. Each iteration consists of manipulating this knowledge. And if necessary (till
this rightmost location we have not sufficient evidence to exclude the possibility of
occurrence), we proceed to investigate the text to the right of this rightmost location.
We use a further analogy to the [KMP] algorithm for the known string matching problem in
order to explain our contribution in the algorithm for the k differences problem. Consider
the following most trivial strings equality problem: Given two strings, find if they are
identical. Observing this problem and its immediate solution was clearly a step (even if it
was very minor) in devising string matching algorithms. Our presentation is strongly
motivated by this anajogy in the following sense. The presentation has two major steps.
In the first step we define an auxiliary problem which is analogous to the string equality
problem when the * differences problem is considered (instead of the string matching
problem). The algorithm for the auxiliary problem uses known techniques ([U]). Our
contribution is in providing the second major step. That is, we give an algorithm for the Jk
differences problem using the algorithm for the first step. The auxiliary problem of the first
major step is less obvious than the strings equality problem and provides an essential part of
our algorithm. However, we feel that our contribution, which is analogous to the whole

-4-

algorithm of [KMP], overcomes a more involved and general problem than in [KMP].

For the original string matching problem and the k mismatches problem, the definition
of the problems imply immediately algorithms which run in 0(jnn) time. [S] gives a simple
0(nm) time algorithm for the * differences problem. In his survey on future directions for
research in string matching, Z. Galil [G85] discusses the k mismatches problem, which seems
substantially easier than our present problem. He states that it is an open question whether
the naive algorithm for the k mismatches problem which takes 0{mn) time can be improved.
Note that the bound on the running time of our algorithm is much smaller than 0{nm). We
already noted that the case * = is the extensively studied string matching problem. There
are a few notable algorithms for the string matching problem: linear time serial algorithms -
[BM], [GS], [KMP], [KR] (a randomized algorithm) and [V], parallel algorithms - [G84] and
[V\. Note that none of these algorithms is suitable to cope with our problems.

[U85] presents an interesting algorithm for the k differences problem whose pattern
analysis takes exponential time. The algorithm runs in time 0(m\1\G + n) and requires
0((|S| + m)G) space, where |2| is the size of the alphabet and G = min(3 m , 2*|2|*m i+1 ).
Preprocessing of the pattern takes 0(m\2\G) time. Then analysis of the text takes 0(n) time
which is pretty impressive. However, the author himself seems to be aware that the space
and preprocessing time requirements make the algorithm impractical, in general. (For
comparison, the space requirement of our first implementation is 0(m 2 ) and of the second is
O(m)).

Section 2 gives an exceedingly simple algorithm for string matching with one different.
Section 3 presents the pattern analysis part of the algorithm for the * differences problem.
Section 4 presents the text analysis part. Both sections 3 and 4 discuss the first
implementation of the algorithm only. The second implementation is given in section 5.

II. PATTERN MATCHING WITH 1 DIFFERENCE IN LINEAR TIME

Below, we describe a simple algorithm for finding all occurrences of the pattern in the
text with at most one difference. [ML] showed how to apply the [KMP] algorithm for the
following slightly modified string matching problem. Find for each location of the text
whether an occurrence of the pattern starts at it. For each location in which there is no
occurrence of the pattern find the leftmost character in which there is a mismatch.
The algorithm for pattern matching with 1 difference has three steps.

1. Run this modified string matching algorithm on the input text and pattern. Suppose that no
occurrence of the pattern starts at character t, of the text. Denote by /(/') the leftmost

-5-

mismatched character of text for the pattern starting at t,. The modified string matching
algorithm would find /(Â»')â€˘

2. Inverse the text (if it was r,...,fâ€ž let it be *â€ž,*â€ž_; fj and the pattern and run the

same algorithm. Suppose that no occurrence of the pattern starts at location t, of the text.
Denote by r(i') the rightmost mismatched character of text for the pattern starting at t t . The
modified string matching algorithm would find r(i).

Recall the three types of differences as defined in the Introduction.

3. Suppose that no occurrence of the pattern starts at location f, of the text.

(i) For all i such that /(Â»') = r(i') , we can conclude that there is an occurrence of the pattern

starting at t t with one mismatch (difference of type (a)).

(ii) For all i such that /(Â«) > r(i-l), we can conclude that there is an occurrence of the

pattern starting at t, with at most one difference of type (b).

(iii) For all i such that /(/) & r(/ + l), we can conclude that there is an occurrence of the

pattern starting at r, with at most one difference of type (c).

Remarks. 1. The running time of this algorithm is 0(n + m). 2. We leave it to the
interested reader to find how to generalize this algorithm for a wider definition of this
problem where one "successive chunk of differences" is allowed.

III. STRING MATCHING WITH Jt DIFFERENCES - PATTERN ANALYSIS (ilrst
implementation).

The algorithm has two parts:

(a) The pattern analysis. We build a table which is based on analysis of the pattern.

(b) The text analysis. We show how to find efficiently all the occurrences of the pattern in
the text with at most k differences, using the result of the pattern analysis.

Analysis of the pattern

The input to the pattern analysis is the pattern, which is given as an array
A = a,, . . . ,a m . The output of the pattern analysis is a two dimensional array
MAX-LENGTH[0,...,m-lfl,...,m-l]. MAX - LENGTH (ij)=f means that

, +1 a, +/ = a y+ i Â«/+/, and a /+/+1 * Â«/+/+i- In words, consider laying the

a

suffix of the pattern starting at a l+l over the suffix of the pattern starting at ay +1 .
MAX - LENGTH (ij) is the longest match of prefixes between these two suffixes.

-6-

Define the pair (ij) ,0s ijsm-1 to be on diagonal d if i-j = d where possible
values of d are -(m-1) s d ss m-1. We compute each diagonal > . >w +i]i
where D l} is the number of differences between a l , . . . , a, and b x , . . . , b,.
It should be obvious that if D mJ s k, for at least one I, w-t s / s m Hit, then the answer to
the auxiliary problem is yes.

The following algorithm computes the matrix Dm^,p t/K+I A

Initialization D 0fi := 0.

for all I, 1 , := i ,.

for i:=l to m do

for l:= 1 to m + k do

D t y.= min (D,_ iJ +l, D /; _ : +l, D l . lJ . i if a, = b s or D,. it! ^^l otherwise).

(D S j is the minimum of three numbers. These numbers are obtained from the

predecessors of D tJ on its column, row and diagonal, rt-spectively).

This algorithm clearly runs in 0(m : ) time.

Q(km) time algorithm for the auxiliary problem. (Due to [U]).
Diagonal d of the matrix consists of all Dn's such thai lâ€”i = d.
Lemma 1 [U]. For every i,l, D,j - D^^-j, is either zero or one.
Lemma 1 implies that we can store the information of the matrix in a more compact way.
For a number of differences e and a diagonal d, let L djl denote the largest row i such that
D, , = e and D :; is on diagonal d. Note, that this implies that there are e differences

between a : a L ^ and fc,, . . . ,b L ^+^ ^d a L ^ +l 4- b L ^ +d+l .

Corollary. For our auxiliary problem we need only values of L d4 , where e and d satisfy

e â– ÂŁâ–  k and \d\ s e .

Proof, e s k is obvious. The initial values of the matrix and Lemma 1 imply that all the D t ,

on a diagonal d are a \d\ and therefore given a number of differences e we need only values

of L dt where \d\ s e.

The answer to the auxiliary problem is yes if one of the L d ., (\d\ S e as k), equals m.

Given d and e we describe how to compute L dt using its definition. That is , L dt is the

largest row such that D,j = e, and D i} is on the diagonal d. In the above 0(m 2 ) time

algorithm the assignment of e into D l} was done using one (or more) of the following data:

(a) D l - 1 j_ l (the predecessor of D tJ on the diagonal d) is e-1 and a, # b,.

(b) D;;-; (the predecessor of D tJ on row i which is also on the diagonal "below" ef) is eâ€” 1.

(c) ÂŁ>/_!; (the predecessor of D ;; on column / which is also on the diagonal "above" ,.

This implies that we can start from D tJ and follow its predecessors on diagonal d by

-8-

possibility (d) till the first time one (or more) of possibilities (a) (b) and (c) occur.
The 0(km) time algorithm for the auxiliary problem is given below. The reader is invited to
convince oneself that the initialization step (Instruction 1) is done in a way which enables the
computation of the L d4 's in the subsequent instructions. Instructions 2-6 compute L dt 's
(|rf| ÂŁ e s k) by "inversing" the above description for computing the L d y* from their
definition. L dit -\, ^-i, e -iÂ» an ^ L d+lit .. : are used to initialize the variable row (Instruction
3), which is then increased by one at a time till it hits the correct value of L d , (Instruction 4).

The 0(km) time algorithm for the auxiliary problem

 Initialization

ford:=-(k+l)to(k+l)do

L d,\d\-2 : ~ -x ;

if d<

then i rfi |rf|_i:= \d\-l;
else L^itf* -1 ;

 for e:~0 iokdu

for d:=-e to e do

 row : - max[{L d , _,+ !), (L^-J, (I* +1 ,-i+l)].
 while a^+y = b mv+l+d do

row : = row + 1 .
ÂŁ j), S[j and MAX-LENGTH do not help us
any more and we apply (the old) Instruction 4 as in the computation of the auxiliary
problem.

We finish this overview of iteration i by showing how to apply the sequence S^ and

MAX -LENGTH to obtain these jumps. The while loop of Instruction 4 looks for the longest

match between prefixes of some suffix of the text t t+row+d + u ... , where

j + 1 s i+row + d sj and some suffix of the pattern a^.^,... . We explain how to find

the maximum w such that a row+l a^^ equals t t+row+d+1 t l+raw+d + w . Suppose

that according to S,j the substring r /+TOH , +rf+1 f ;+roW+/ matches c c+1 , . . . ,a c+f for

some index c of the pattern and c +f is the maximal index of the pattern for which this match
holds. We can find those c and / using the fact that for each t h+1 (i s h < j) there exists a
triple (p^c^JJ in S tJ such that p 1 s*sp 1 + / 1 . ((p^c^J covers f A + 1 ). For the

- 10-

computation of w we need to break into the following cases:

Case(a). /2l, M AX- LENGTH (c, row) gives the maximal number g such that

a c+: a c + t equals a row+l Â«Â«*+,â–  Case ( a ) nas ^o subcases.

Cose{a\). f*g. It is easy to see that here t t+m , +d+x , '/ +row+ rf +mtaVrf) =

flrow+i "nw+mh^i and t l+row+d + min(fig) + l * o w+mh(r>f)+1 . Therefore, we assign

w:= min(f,g).

Case(a2). f=g. This implies r f+rw+d+] t l+row+d + f = a rm+i c^^ but does not

reveal whether t l+row+a+ f +1 equals a njw + f+l or not. Therefore, we assign row:= row+f and

apply again the present case analysis accumulating this "jump" over / symbols into w .

Case (ÂŁÂ»)./= 0. Case (b) has two subcases.

Case(b\). t l+rvw+d+1 * a rcw+i . Hence, we assign >v:= 0.

Case(b2). t !+royv+d+l = a, w+ i. Therefort, we assign row:= row + 1, and we apply again the

present case analysis accumulating this propagation of 1 into w.

-11-

The text analysis algorithm
;:=0;

for i: = to n-m+k do
begin

 Proper initialization (as in the 0(km) algorithm for the auxiliary problem)
 for e:=0 to k do
for d:=-e to e do

 row := mox[(L rf> ,_ 1 ,+ l), (L^^J, (I d+lf ,_i+l)].
[4. new] while i+row + d+1 s j do

[4.new.l] take from S tJ the triple that "covers" r /+rwv+rf+1 . Derive from

this triple the indices cj such that t t+mt+d+i ,f, +/w+4+/

= fl c+1 , . . . ,a c +f (f, +roH , +

1

Online LibraryGad M LandauEfficient string matching with k differences → online text (page 1 of 2)