Jiazhen Cai.

More efficient bottom-up multipattern matching in trees online

. (page 1 of 5)
Font size
QR-code for this ebook

Computer Science Department









TECHNICAL REPORT






?^v,



^>*ia>Rl^



^'V'l?:.



More Efficient Bottom-Up
Multi-Pattern Matching in Trees



j-f

o



4-1

C
C i (0 o



c
o

4-)

e
c

(C

Q-,

I •

-^^ w

Jj QJ
r-l t] .



Hoffmann and O'Donnell use an equivalent recursive definition of match sets (but restricted
to subjects without variable occurrences) to obtain an efficient bottom-up algorithm. The recursive
rules shown below add a new rule for MS{v) to Hoffmann and O'Donnell's rules so that match sets
can be defined for arbitrary patterns.

MS(v)= {v}

MS(c) = ( V } , when constant c i PF
{ V, c } , when constant c e PF

(1) MS(f(t,, .... t,)) = [f{q^ q,,) e PF I q, e MSit,),i = 1 k] u (v)

Surprisingly, this new rule is merely a formalism, since it gives rise to the exact same collection of
match sets as derived by Hoffmann and O'Donnell. This is true, because the match set MS(p) for
an arbitrary pattern p is identical to the match set MS{t) for any pattern t formed from p by replac-
ing occurrences of v in /? by occurrences of arbitrary constants that do not belong to PF.

After determining match sets for constants and variable occurrences in subject t, Hoffmann
and O'Donnell's algorithm solves the Bottom-Up Subproblem by identifying the match set for

each subpattem /(r 1 r*) of f based on the match sets for r,, i = \, ..., k. This task, which we call

the Bottom-Up Step, computes expression (1) by an Oik) time lookup in a ^-dimensional array stor-
ing transition mapT/, where X/{A/5 (f i ), ...,M5(fi)) = A/5(/(fi t^ )).

For consistency, throughout this paper we consider an instance of the Multi-Pattern Matching

Problem with paaem set P, pattern forest PF, and subject t. We also use the following parameters:

n = size of t

r = the set of match sets for P

l=\PF\

o= \MPTMp(t)\

f^max = maximum arity of any function symbol appearing in PF

In order to compute Step (1) and print the set MS(f(t\, .... t/^)) nP of patterns that match
f(t\, ..., tk) in time Oik + \ MS if it \ t^)) nP \), Hoffmann and O'Donnell preprocess the pat-
terns in P to

i. encode each pattern in PF as a distinct integer from 1 to /, and represent patterns as
trees in the obvious way (implemented in compressed form as dags);

ii. compute all match sets, and encode each such set as a distinct integer from 1 to I Fl ;

iii. compute the subset of patterns in P belonging to the /''" match set for / = 1, ..., I Fl ;

iv. compute a transition map Xy for every ^-ary function symbol / occurring in P so that

Xf(MSiti) MSitt)) = MSifUi, .... ti^)); x^ = [v], and x,. = {v, c) if c is any constant

appearing in PF, transition maps Xy are implemented as /: -dimensional arrays accessed
using integer encodings of match sets.

After preprocessing the patterns in P, Hoffmann and O'Donnell's algorithm solves the
Multi-Pattern Matching Problem by repeatedly solving Step (1) from innermost to outermost sub-
pattern of t. Their worst case time is (9(n -t- o) after preprocessing P. The array storing the transi-
tion map Xy for each *-ary function symbol / appearing in PF uses n(IFI*) space, where the



-7-

number in of match sets can be t^(2'), which is expensive in practice. Their rough bound on
preprocessing time is


1 3 4 5