Online Library → R. (Rajamani) Sundar → Amortized complexity of data structures → online text (page 1 of 7)

Font size

Computer Science Department

TECHNICAL REPORT

Amortized Complexity of Data Structures

Rajamani Sundar

Technical Report 561

May 1991

NEW YORK UNIVERSITY

JVC

I C

IK m

|E^ e

(0

â€¢1â€”1

|(_) (0

W OS

\(^

Is ^

O u

in c

CO

>1

â– H

a; CO

e a

o +J

o o

a

'O u

MT ttmuTum usmar

Amortized Complexity of Data Structures

Rajamani Sundar

October 1991

A dissertation in the Department of Computer Science submitted to the faculty of

the Graduate School of Arts and Science in partial fulfillment of the requirements for

the degree of Doctor of Philosophy at New York University.

Approved

(Signed)

Ravi Boppana

Research Advisor

Â© Copyright by Rajamani Sundar 1991

All Rights Reserved

Amortized Complexity of Data Structures

Rajamani Sundar

Advisor: Ravi Boppana

Abstract

This thesis investigates the amortized complexity of some fundamental data structure

problems and introduces interesting ideas for proving lower bounds on amortized com-

plexity and for performing amortized analysis. The problems are as follows:

â€¢ Dictionary Problem: A dictionary is a dynamic set that supports searches of ele-

ments and changes under insertions and deletions of elements. It is open whether

there exists a dictionary data structure that takes constant amortized time per

operation and uses space polynomial in the dictionary size. We prove that dictio-

nary operations require log-logarithmic amortized time under a multilevel hashing

model that is based on Yao's cell probe model.

â€¢ Splay Algorithm's Analysis: Splay is a simple, efficient algorithm for searching

binary search trees, devised by Sleator and Tarjan, that uses rotations to reorganize

the tree. Tarjan conjectured that Splay takes linear time to process deque operation

sequences on a binary tree and proved a special case of this conjecture called the

Scanning Theorem. We prove tight bounds on the maximum numbers of various

types of right rotations in a sequence of right rotations performed on a binary

tree. One of the lower bounds refutes a conjecture of Sleator. We apply the upper

bounds to obtain a nearly linear upper bound for Tarjan's conjecture. We give two

new proofs of the Scanning Theorem, one of which is a potential-based proof that

solves a problem of Tarjan.

â€¢ Set Equality Problem: The task of maintaining a dynamic collection of sets under

various operations arises in many applications. We devise a fast data structure

for maintaining sets under equality-tests and under creations of new sets through

insertions and deletions of elements. Equality-tests require constant time and set-

creations require logarithmic amortized time. This improves previous solutions.

Acknowledgements

It is a pleasure to acknowledge the help of many people in performing the work of this

thesis.

I would like to thank my advisor Ravi Boppana for teaching me many useful research

skills and teaching/presentation skills. I learnt from him the importance of simple and

clear formulation of ideas, and about several useful ideas in Theoretical Computer Science

and Discrete Mathematics. The work on the Dictionary Problem owes a great deal to

the various things I learnt from him and to his help. He showed me the initial directions

to proceed along, he came up with fresh approaches to pursue whenever I failed, and he

kept encouraging me to succeed.

I would like to thank my co-advisor Richard Cole for initiating me into the area of

amortized analysis, for always being a source of inspiration, and for always giving me

kind attention and advice when I needed them. The work on the Deque Conjecture owes

greatly to the several fruitful discussions I had with him.

I would like to thank Mike Fredman, Bud Mishra, and Bob Tarjan for serving on my

thesis defense committee and for their role in this work. Mike Fredman and Bob Tarjan

supported and guided me during the summer of 1989 when I was a student under them

at Bellcore and at Bell labs, respectively. They were always sources of nice problems to

work on, and of inspiration and encouragement. The work on the Set Equality Problem

is part of a larger joint-work done with Bob Tarjan. Bud Mishra supported me during

the academic year 1988-89, and was always a source of encouragement and help.

1 would like to thank the Courant Institute community for providing a great envi-

ronment for this work.

Finally, I would like to thank IBM Corporation for graciously supporting me during

the academic vears 1989-91.

Contents

1 Introduction 1

1.1 Amortized Complexity 1

1.2 Our Work 3

2 The Dictionary Problem 5

2.1 Introduction 5

2.2 Single-level Hashing Model 9

2.2.1 Uniform Hash Functions and Worst-case Complexity 11

2.2.2 Nonuniform Hash Functions 12

2.2.3 Amortization 13

2.3 Multilevel Hashing Model 13

2.3.1 Partial Hashing Model 15

2.3.2 Adversary 16

2.3.3 Two Random Sampling Lemmas 21

2.3.4 The Worst-case Lower Bound 27

2.3.5 Amortization 30

3 The Deque Conjecture 35

3.1 Introduction 35

3.1.1 The Splay Algorithm and Its Conjectures 35

3.1.2 Terminology 39

ill

3.1.3 Previous Works 41

3.1.4 Our Results 41

3.2 Counting Twists, Turns, and Cascades 43

3.2.1 Upper Bounds 43

3.2.2 Lower Bounds 51

3.3 An Upper Bound for the Deque Conjecture 62

3.4 New Proofs of the Scanning Theorem 65

3.4.1 A Potential-based Proof 66

3.4.2 An Inductive Proof 68

4 Testing Set Equality 75

4.1 Introduction 75

4.2 The Data Structure 78

4.3 The Analysis 79

4.4 Directions for Further Work 83

Bibliography 85

IV

Chapter 1

Introduction

1.1 Amortized Complexity

In many applications of data structures, the data structure is embedded within some

algorithm that performs successive operations on it. In these applications, we are inter-

ested only in the time taken by the data structure to process operation sequences as a

whole and not in the time spent on isolated operations. Amortized data structures are

data structures tailored to such applications: these data structures may perform poorly

on a few individual operations but perform very well on all operation sequences. The

natural performance measure for an amortized data structure is its amortized complexity,

defined to be the maximum cost of operation sequences performed on the data structure

as a function of the lengths of the sequences. Amortized data structures are appealing

because they dispense with complicated constraints and associated information present

in data structures that achieve a fast performance on all operations and they use simple

reorganizing heuristics, instead, to achieve a fast amortized performance. Some exam-

ples of these data structures are the compressed tree data structures for the Union-find

Problem [1,23,32], the Splay Tree [23,28,32], and the Pairing Heap [16].

Amortized data structures are simple to describe but their performance analysis is

often quite Involved. Since operation sequences on these data structures are mixtures

of operations of varying costs that very finely interact with one another it is tricky to

accurately estimate their amortized complexity. Of the three amortized data structures

mentioned above only the first one has been analyzed thoroughly; even its analysis was

accomplished only several years after the data structure was originally conceived. The

2 CHAPTER 1. INTRODUCTION

complete aneilysis of the other two is still open.

A useful framework for performing amortized analysis involves defining an appro-

priate potential function for the data structure [34]. In this framework, each state of

the data structure is assigned a real number called its potential. The amortized cost of

a data structure operation is defined to be the actual cost incurred by that operation

plus the increase in potential it causes through change of state. The total cost of an

operation sequence performed on the data structure equals the total amortized cost of

the operations in the sequence plus the decrease in potential from the initial state to the

state at the end of the operation sequence. Choosing a suitable potential function that

yields a sharp estimate on the amortized complexity is a task that demands ingenuity.

We illustrate the potential framework through a simple example. A priority queue

with attrition (PQA) is a dynamic set of real numbers supporting two operations:

Deletemin deletes and returns the smallest element of the set; Insert(i) deletes all

elements of the set greater than x and adds x to the set. A PQ.4 can be represented

by a sorted list of its elements. In this representation, Deletemin takes constant time,

and Insert takes time proportional to the number of deleted elements. If we define the

potential of the data structure equal to the number of elements it contains then PQA

operations take constant amortized time; PQA operation sequences take linear time.

The notions of amortized complexity and amortized cost are usually used in a much

wider sense than defined above. Amortized complexity is usually a maximizing function

of several parameters of the operation sequences, instead of their length alone. The

amortized costs of operations are usually a set of functions of the operation sequence

parameters that, when added together, yield a good estimate of the amortized complex-

ity. For example, the compressed tree data structures for the Union-find Problem have

amortized complexity equal to 0{n + ma(m -\- n,n)), and take constant time on Union

operations and 0{a(m + n,n}) amortized time on Find operations, where a(m, n) is an

inverse function of the Ackerman function, and m and n respectively denote the total

number of Finds and the total number of Unions in the operation sequence [23,32].

Let us relate amortized complexity to other measures of data structure performance.

There exist data structure problems whose amortized complexities are lower than their

worst-case complexities; for instance, the Union-find Problem is solvable in nearly con-

stant amortized time per operation but requires nearly logarithmic worst-case time per

1.2. OUR WORK 3

operation [7,15,32]. There probably exist problems whose randomized (or average-case)

complexities are lower than their amortized complexities, but this is yet to be proven; for

instance, the Dictionary Problem and certain other problems that involve testing equal-

ity of objects appear to be good candidates for this class. The amortized complexity of

a dynamic data structure problem is often intimately related to the complexity of its

static version via transformations of solutions/adversaries of one problem into another

[6,39].

An appropriate model for proving lower bounds on the amortized complexity of a

data structure problem is Yao's cell probe model [38]. This model abstracts out the

arithmetic and indexing capabilities of random access machines without ignoring their

word-size limitations. The model has a memory consisting of an array of 6-bit locations.

Data structure operations are performed in a series of memory probes in which the next

probe location is always computed as a general function of the values so far read. In this

model, Fredman and Saks [15] proved tight lower bounds on the amortized complexity

of many problems, including the Union-find Problem. The only other interesting lower

bound known in this model is for a static data structure problem, due to Ajtai [3]. The

complexity of many other problems, notably, the Dictionary Problem and the Priority

Queue Problem, remains unexplored.

This completes our introduction to amortized complexity. Further information on

this subject can be found in [23,32,34].

1.2 Our Work

This thesis investigates the amortized complexity of some fundamental data structure

problems. We introduce interesting ideas for proving lower bounds on amortized com-

plexity and for performing amortized analysis that enable progress towards resolving

some open questions. The problems studied are as follows.

In Chapter 2, we study the amortized complexity of the Dictionary Problem. A dic-

tionary is a dynamic set that supports searches, insertions, and deletions of elements. It

is an open problem whether a dictionary can be maintained in constant amortized time

per operation using space polynomial in the dictionary size; we denote the dictionary

size by n. While hcishing schemes solve the problem in constant amortized time per oper-

4 CHAPTER 1. INTRODUCTION

ation on the average or using randomness, the best deterministic solution (uses hashing

and) takes 0(logn/loglogn) amortized time per operation. We study the deterministic

complexity of the dictionary problem under a multilevel hashing model that is based on

Yao's cell probe model, and prove that dictionary operations require fi(loglogn) amor-

tized time in this model. Our model encompasses many known solutions to the dictionary

problem, and our result is the first nontrivial lower bound for the problem in a reason-

ably general model that takes into account the limited wordsize of memory locations

and realistically measures the cost of update operations. This lower bound separates the

deterministic and randomized complexities of the problem under this model.

In Chapter 3, we study a problem arising in the analysis of Splay, a rotation-based al-

gorithm for searching binary search trees that was devised by Sleator and Tarjan. Tarjan

proved that Splay takes linear time to scan the nodes of a binary tree in symmetric order;

this result is called the Scanning Theorem. More generally, he conjectured that Splay

takes linear time to process deque operation sequences on a binary tree; this conjecture is

called the Deque Conjecture. We prove that Splay takes linear-times-inverse-Ackerman

time to process deque operation sequences. In the process, we obtain tight bounds on

some interesting combinatorial problems involving rotation sequences on binary trees.

These problems arose from studying a conjecture of Sleator that we refute. We give two

new proofs of the Scanning Theorem. One proof is a potential-based proof; Tarjan had

originally posed the problem of finding such a proof. The other proof uses induction.

In Chapter 4, we study the problem of maintaining a dynamic collection of sets under

equality-tests of two sets and under creations of new sets through insertions and deletions

of elements. We devise a data structure that takes constant time on equality-tests and

logarithmic amortized time on set-creations. The data structure derandomizes a previous

randomized data structure, due to Pugh and Teitelbaum, that took logarithmic expected

amortized time on set-creations.

Some of our work has been published before. The work on the Deque Conjecture was

reported in [30]. The work on the Set Equality Problem was reported in a joint-paper

with Robert E. Tarjan [31] that also dealt with other related problems.

Chapter 2

The Dictionary Problem

The Dictionary Problem is a classical data structure problem arising in many applications

such as symbol table management in compilers and language processors. The problem

is to maintain a dynamic set under the operations of search, insertion, and deletion of

elements. The problem can be solved in constant time per operation using a bit vector

that has a separate bit for each element of the universe indicating its prescence in the

set. This simple solution is unsatisfactory since it uses space proportional to the universe

size and the dictionary is usually very small compared to the universe. Does there exist a

constant-amortized-time solution that uses only space polynomial in the dictionary size?

In this chapter, we study the amortized complexity of the dictionary problem under

a multilevel hashing model that is based on Yao's cell probe model, and prove that

dictionary operations require log-logarithmic amortized time in this model.

2.1 Introduction

The Dictionary Problem is to maintain a dynamic set, called a dictionary, over a finite,

ordered universe U under the following operations:

â€¢ Search(i): Determine whether element x is in the dictionary; if so, find a memory

location storing i.

â€¢ Insert(i): Add element x to the dictionary.

â€¢ Delete(i): Remove element x from the dictionary.

5

6 CHAPTER 2. THE DICTIONARY PROBLEM

The dictionary is initially the empty set. We would like to know how fast the problem

can be solved using only space polynomial in the dictionary size (denoted by n ) on a

RAM with (loglf'l)-bit memory words. Here we are restricting the space used in terms

of the dictionary size in order to exclude the trivial bit vector solution and a family of

its generalizations, due to Willard [37], that process dictionary operations in constant

time and use space depending on the size of the universe.

The Dictionary Problem has been extensively studied and has a rich variety of solu-

tions.

Balanced search trees solve the problem in O(log n ) time per operation and use 0(n)

space [1,21,23,32]. Self-adjusting search trees such as Sleator and Tarjan's Splay Tree

[28] and Galperin and Rivest's Scapegoat Tree []x] take O(logn) amortized time per

operation. There is no hope of improving the log n behaviour of search trees since they

use only comparisons to perform searches.

Hashing schemes solve the problem in constant e.xpected time per operation and

use Gin) space. Uniform hashing performs dictionary operations in constant expected

time when the operations are randomly chosen from a uniform distribution [21,1]; the

space used is 0(n). Universal hashing is an improved randomized hashing scheme that

performs searches in constant expected time and performs updates in constant expected

amortized time [8]; the expectation, here, is over the random hash functions chosen

by the algorithm and the bounds apply to all possible operation sequences. Dynamic

perfect hashing [13] (see also [2,14]) is a further improvement that performs searches in

constant worst-case time and performs updates in constant expected amortized time.

AH of the above hashing schemes fall under the general category of multilevel hashing

schemes. Roughly speaking, a multilevel hashing scheme consists of a hierarchically

organized system of hash functions that successively partition the dictionary into finer

pieces until all elements in the dictionary have beÂ«>n separated plus an algorithm to

update the configuration after each dictionary opÂ«Tation.

The fastest deterministic solution to the problem, at present, is a multilevel hashing

scheme, due to Fredman and Willard [17], that takos 0(!ogn/ loglogn) amortized time

per operation and uses 0{n) space.

It has been possible to prove nonconstant lower bounds for the problem on certain de-

terministic hashing models. Dietzfelbinger et al. [13] showed a tight bound of 0(logn) on

2.1. INTRODUCTION 7

the amortized complexity of dictionary operations under a wordsize-independent multi-

level hashing model that does not include search trees. Mehlhorn et al. [24] strengthened

this model so that it includes search trees and showed that dictionary operations have

0(loglogn) amortized complexity under their model. These models assume that mem-

ory locations have an unbounded wordsize and overestimate the costs of operations; this

simplifies the task of proving lower bounds but the models are not realistic. If memory

locations have a sufficiently large wordsize, the problem is solvable in constant time per

operation (Paul and Simon [25]; Ajtcii et al. [4]). When memory locations can only rep-

resent a single element of the universe, however, the best available solution is Fredman

and Willard's O(logn/ loglogn)-time solution [17].

We prove a nonconstant lower bound for the Dictionary Problem under a multilevel

hashing model, based on Yao's cell probe model, that takes into account the limited

wordsize of memory locations and realistically measures the costs of operations.

We define a generic multilevel hashing model for solving the Dictionary Problem

from which various lower bound models for the problem can be derived by specifying

suitable cost measures for the operations. The model has a vertically organized memory

that consists of a root location at level 1 and an array of at most m locations at level i,

for each i > 2. Memory locations have b bits each, for some b > \og\U\. Each memory

location stores some c different elements of the universe plus a (6 - clog|f/|)-bit value

that guides search operations; here the number of elements stored at a location varies

over time, and is not constant. A search operation locates an element in memory by

performing a sequence of memory probes: the first probe of the sequence always reads

the root location and the i-th probe, for i > 2, reads a location at level i that is com-

puted as a general function of the sequence of z â€” I values so far read and the element

being searched. The operation either locates the element in some location after a series

of probes or concludes that the element is not in the dictionary after a series of unsuc-

cessful probes. A search operation finally replaces the current memory configuration by

a new configuration, representing the same dictionary, by computing a general function

of the current configuration. An update operation simply replaces the current memory

configuration by a new configuration that represents the new dictionary by computing a

general function of the current configuration.

8 CHAPTER 2. THE DICTIONARY PROBLEM

We define a lower bound model for the problem by imposing a measure of operation

costs on the generic model. The Hamming cost of an operation is defined as the maximum

number of locations at any level whose contents change due to the operation. The cost

of a search operation is defined as the number of read probes performed by the operation

(called the search cost of the operation) plus the Hamming cost of the operation. The

cost of an update operation is defined as its Hamming cost. When this cost measure

is imposed on the generic model we get a lower bound model called the Hamming cost

model.

We prove our lower bound in the following model that has a different cost measure

from the Hamming cost model. The search path of an element x of (/ in a memory

configuration C is defined as the sequence of locations probed by a search operation on

X performed in configuration C. We say that an operation refreshes a memory location

/ if there is some element x in the final dictionary such that / lies on the search path

of X after the operation but / did not lie on the search path of x before the operation.

The refresh cost of an operation is defined as the maximum number of locations at any

level that get refreshed by the operation. Define the cost of an operation as the sum of

the search cost and the refresh cost of the operation. The lower bound model with this

cost measure is called the refresh cost model. The difference between this model and

the Hamming cost model is that the refresh cost measure is sensitive to locations that

participate in rerouting the search paths of dictionary elements during an operation,

on the other hand, the Hamming cost measure is sensitive to locations whose contents

change.

A nonconstant lower bound for the Dictionary Problem in the Hamming cost model

would translate into a similar lower bound for the problem in the cell probe model. We

believe that such a lower bound exists, but we can only prove a lower bound under

the refresh cost model. The refresh cost model seems to be incomparable in power to

the cell probe model and to the Hamming cost model. The justification for the refresh

cost model is that, in realistic hashing schemes, an operation might have to read, and,

if necessary, modify, the locations it refreshes in order to correctly reroute the search

paths of dictionary elements. The refresh cost model encompasses many of the known

solutions to the dictionary problem, but not all of them; for instance, the model includes

B-trees, red-black trees, and various hashing schemes, but the clatss of rotations- based

2.2. SINGLE-LEVEL HASHING MODEL 9

search trees is not included.

Our lower bound is given by:

Theorem 2.1 Consider the refresh cost multilevel hashing model with at most m mem-

ory locations per level, a universe U, and a wordsize b. Let n be a positive integer

satisfying \U\ > m'Â°8 'Â«". Consider dictionary operation sequences during which the

TECHNICAL REPORT

Amortized Complexity of Data Structures

Rajamani Sundar

Technical Report 561

May 1991

NEW YORK UNIVERSITY

JVC

I C

IK m

|E^ e

(0

â€¢1â€”1

|(_) (0

W OS

\(^

Is ^

O u

in c

CO

>1

â– H

a; CO

e a

o +J

o o

a

'O u

MT ttmuTum usmar

Amortized Complexity of Data Structures

Rajamani Sundar

October 1991

A dissertation in the Department of Computer Science submitted to the faculty of

the Graduate School of Arts and Science in partial fulfillment of the requirements for

the degree of Doctor of Philosophy at New York University.

Approved

(Signed)

Ravi Boppana

Research Advisor

Â© Copyright by Rajamani Sundar 1991

All Rights Reserved

Amortized Complexity of Data Structures

Rajamani Sundar

Advisor: Ravi Boppana

Abstract

This thesis investigates the amortized complexity of some fundamental data structure

problems and introduces interesting ideas for proving lower bounds on amortized com-

plexity and for performing amortized analysis. The problems are as follows:

â€¢ Dictionary Problem: A dictionary is a dynamic set that supports searches of ele-

ments and changes under insertions and deletions of elements. It is open whether

there exists a dictionary data structure that takes constant amortized time per

operation and uses space polynomial in the dictionary size. We prove that dictio-

nary operations require log-logarithmic amortized time under a multilevel hashing

model that is based on Yao's cell probe model.

â€¢ Splay Algorithm's Analysis: Splay is a simple, efficient algorithm for searching

binary search trees, devised by Sleator and Tarjan, that uses rotations to reorganize

the tree. Tarjan conjectured that Splay takes linear time to process deque operation

sequences on a binary tree and proved a special case of this conjecture called the

Scanning Theorem. We prove tight bounds on the maximum numbers of various

types of right rotations in a sequence of right rotations performed on a binary

tree. One of the lower bounds refutes a conjecture of Sleator. We apply the upper

bounds to obtain a nearly linear upper bound for Tarjan's conjecture. We give two

new proofs of the Scanning Theorem, one of which is a potential-based proof that

solves a problem of Tarjan.

â€¢ Set Equality Problem: The task of maintaining a dynamic collection of sets under

various operations arises in many applications. We devise a fast data structure

for maintaining sets under equality-tests and under creations of new sets through

insertions and deletions of elements. Equality-tests require constant time and set-

creations require logarithmic amortized time. This improves previous solutions.

Acknowledgements

It is a pleasure to acknowledge the help of many people in performing the work of this

thesis.

I would like to thank my advisor Ravi Boppana for teaching me many useful research

skills and teaching/presentation skills. I learnt from him the importance of simple and

clear formulation of ideas, and about several useful ideas in Theoretical Computer Science

and Discrete Mathematics. The work on the Dictionary Problem owes a great deal to

the various things I learnt from him and to his help. He showed me the initial directions

to proceed along, he came up with fresh approaches to pursue whenever I failed, and he

kept encouraging me to succeed.

I would like to thank my co-advisor Richard Cole for initiating me into the area of

amortized analysis, for always being a source of inspiration, and for always giving me

kind attention and advice when I needed them. The work on the Deque Conjecture owes

greatly to the several fruitful discussions I had with him.

I would like to thank Mike Fredman, Bud Mishra, and Bob Tarjan for serving on my

thesis defense committee and for their role in this work. Mike Fredman and Bob Tarjan

supported and guided me during the summer of 1989 when I was a student under them

at Bellcore and at Bell labs, respectively. They were always sources of nice problems to

work on, and of inspiration and encouragement. The work on the Set Equality Problem

is part of a larger joint-work done with Bob Tarjan. Bud Mishra supported me during

the academic year 1988-89, and was always a source of encouragement and help.

1 would like to thank the Courant Institute community for providing a great envi-

ronment for this work.

Finally, I would like to thank IBM Corporation for graciously supporting me during

the academic vears 1989-91.

Contents

1 Introduction 1

1.1 Amortized Complexity 1

1.2 Our Work 3

2 The Dictionary Problem 5

2.1 Introduction 5

2.2 Single-level Hashing Model 9

2.2.1 Uniform Hash Functions and Worst-case Complexity 11

2.2.2 Nonuniform Hash Functions 12

2.2.3 Amortization 13

2.3 Multilevel Hashing Model 13

2.3.1 Partial Hashing Model 15

2.3.2 Adversary 16

2.3.3 Two Random Sampling Lemmas 21

2.3.4 The Worst-case Lower Bound 27

2.3.5 Amortization 30

3 The Deque Conjecture 35

3.1 Introduction 35

3.1.1 The Splay Algorithm and Its Conjectures 35

3.1.2 Terminology 39

ill

3.1.3 Previous Works 41

3.1.4 Our Results 41

3.2 Counting Twists, Turns, and Cascades 43

3.2.1 Upper Bounds 43

3.2.2 Lower Bounds 51

3.3 An Upper Bound for the Deque Conjecture 62

3.4 New Proofs of the Scanning Theorem 65

3.4.1 A Potential-based Proof 66

3.4.2 An Inductive Proof 68

4 Testing Set Equality 75

4.1 Introduction 75

4.2 The Data Structure 78

4.3 The Analysis 79

4.4 Directions for Further Work 83

Bibliography 85

IV

Chapter 1

Introduction

1.1 Amortized Complexity

In many applications of data structures, the data structure is embedded within some

algorithm that performs successive operations on it. In these applications, we are inter-

ested only in the time taken by the data structure to process operation sequences as a

whole and not in the time spent on isolated operations. Amortized data structures are

data structures tailored to such applications: these data structures may perform poorly

on a few individual operations but perform very well on all operation sequences. The

natural performance measure for an amortized data structure is its amortized complexity,

defined to be the maximum cost of operation sequences performed on the data structure

as a function of the lengths of the sequences. Amortized data structures are appealing

because they dispense with complicated constraints and associated information present

in data structures that achieve a fast performance on all operations and they use simple

reorganizing heuristics, instead, to achieve a fast amortized performance. Some exam-

ples of these data structures are the compressed tree data structures for the Union-find

Problem [1,23,32], the Splay Tree [23,28,32], and the Pairing Heap [16].

Amortized data structures are simple to describe but their performance analysis is

often quite Involved. Since operation sequences on these data structures are mixtures

of operations of varying costs that very finely interact with one another it is tricky to

accurately estimate their amortized complexity. Of the three amortized data structures

mentioned above only the first one has been analyzed thoroughly; even its analysis was

accomplished only several years after the data structure was originally conceived. The

2 CHAPTER 1. INTRODUCTION

complete aneilysis of the other two is still open.

A useful framework for performing amortized analysis involves defining an appro-

priate potential function for the data structure [34]. In this framework, each state of

the data structure is assigned a real number called its potential. The amortized cost of

a data structure operation is defined to be the actual cost incurred by that operation

plus the increase in potential it causes through change of state. The total cost of an

operation sequence performed on the data structure equals the total amortized cost of

the operations in the sequence plus the decrease in potential from the initial state to the

state at the end of the operation sequence. Choosing a suitable potential function that

yields a sharp estimate on the amortized complexity is a task that demands ingenuity.

We illustrate the potential framework through a simple example. A priority queue

with attrition (PQA) is a dynamic set of real numbers supporting two operations:

Deletemin deletes and returns the smallest element of the set; Insert(i) deletes all

elements of the set greater than x and adds x to the set. A PQ.4 can be represented

by a sorted list of its elements. In this representation, Deletemin takes constant time,

and Insert takes time proportional to the number of deleted elements. If we define the

potential of the data structure equal to the number of elements it contains then PQA

operations take constant amortized time; PQA operation sequences take linear time.

The notions of amortized complexity and amortized cost are usually used in a much

wider sense than defined above. Amortized complexity is usually a maximizing function

of several parameters of the operation sequences, instead of their length alone. The

amortized costs of operations are usually a set of functions of the operation sequence

parameters that, when added together, yield a good estimate of the amortized complex-

ity. For example, the compressed tree data structures for the Union-find Problem have

amortized complexity equal to 0{n + ma(m -\- n,n)), and take constant time on Union

operations and 0{a(m + n,n}) amortized time on Find operations, where a(m, n) is an

inverse function of the Ackerman function, and m and n respectively denote the total

number of Finds and the total number of Unions in the operation sequence [23,32].

Let us relate amortized complexity to other measures of data structure performance.

There exist data structure problems whose amortized complexities are lower than their

worst-case complexities; for instance, the Union-find Problem is solvable in nearly con-

stant amortized time per operation but requires nearly logarithmic worst-case time per

1.2. OUR WORK 3

operation [7,15,32]. There probably exist problems whose randomized (or average-case)

complexities are lower than their amortized complexities, but this is yet to be proven; for

instance, the Dictionary Problem and certain other problems that involve testing equal-

ity of objects appear to be good candidates for this class. The amortized complexity of

a dynamic data structure problem is often intimately related to the complexity of its

static version via transformations of solutions/adversaries of one problem into another

[6,39].

An appropriate model for proving lower bounds on the amortized complexity of a

data structure problem is Yao's cell probe model [38]. This model abstracts out the

arithmetic and indexing capabilities of random access machines without ignoring their

word-size limitations. The model has a memory consisting of an array of 6-bit locations.

Data structure operations are performed in a series of memory probes in which the next

probe location is always computed as a general function of the values so far read. In this

model, Fredman and Saks [15] proved tight lower bounds on the amortized complexity

of many problems, including the Union-find Problem. The only other interesting lower

bound known in this model is for a static data structure problem, due to Ajtai [3]. The

complexity of many other problems, notably, the Dictionary Problem and the Priority

Queue Problem, remains unexplored.

This completes our introduction to amortized complexity. Further information on

this subject can be found in [23,32,34].

1.2 Our Work

This thesis investigates the amortized complexity of some fundamental data structure

problems. We introduce interesting ideas for proving lower bounds on amortized com-

plexity and for performing amortized analysis that enable progress towards resolving

some open questions. The problems studied are as follows.

In Chapter 2, we study the amortized complexity of the Dictionary Problem. A dic-

tionary is a dynamic set that supports searches, insertions, and deletions of elements. It

is an open problem whether a dictionary can be maintained in constant amortized time

per operation using space polynomial in the dictionary size; we denote the dictionary

size by n. While hcishing schemes solve the problem in constant amortized time per oper-

4 CHAPTER 1. INTRODUCTION

ation on the average or using randomness, the best deterministic solution (uses hashing

and) takes 0(logn/loglogn) amortized time per operation. We study the deterministic

complexity of the dictionary problem under a multilevel hashing model that is based on

Yao's cell probe model, and prove that dictionary operations require fi(loglogn) amor-

tized time in this model. Our model encompasses many known solutions to the dictionary

problem, and our result is the first nontrivial lower bound for the problem in a reason-

ably general model that takes into account the limited wordsize of memory locations

and realistically measures the cost of update operations. This lower bound separates the

deterministic and randomized complexities of the problem under this model.

In Chapter 3, we study a problem arising in the analysis of Splay, a rotation-based al-

gorithm for searching binary search trees that was devised by Sleator and Tarjan. Tarjan

proved that Splay takes linear time to scan the nodes of a binary tree in symmetric order;

this result is called the Scanning Theorem. More generally, he conjectured that Splay

takes linear time to process deque operation sequences on a binary tree; this conjecture is

called the Deque Conjecture. We prove that Splay takes linear-times-inverse-Ackerman

time to process deque operation sequences. In the process, we obtain tight bounds on

some interesting combinatorial problems involving rotation sequences on binary trees.

These problems arose from studying a conjecture of Sleator that we refute. We give two

new proofs of the Scanning Theorem. One proof is a potential-based proof; Tarjan had

originally posed the problem of finding such a proof. The other proof uses induction.

In Chapter 4, we study the problem of maintaining a dynamic collection of sets under

equality-tests of two sets and under creations of new sets through insertions and deletions

of elements. We devise a data structure that takes constant time on equality-tests and

logarithmic amortized time on set-creations. The data structure derandomizes a previous

randomized data structure, due to Pugh and Teitelbaum, that took logarithmic expected

amortized time on set-creations.

Some of our work has been published before. The work on the Deque Conjecture was

reported in [30]. The work on the Set Equality Problem was reported in a joint-paper

with Robert E. Tarjan [31] that also dealt with other related problems.

Chapter 2

The Dictionary Problem

The Dictionary Problem is a classical data structure problem arising in many applications

such as symbol table management in compilers and language processors. The problem

is to maintain a dynamic set under the operations of search, insertion, and deletion of

elements. The problem can be solved in constant time per operation using a bit vector

that has a separate bit for each element of the universe indicating its prescence in the

set. This simple solution is unsatisfactory since it uses space proportional to the universe

size and the dictionary is usually very small compared to the universe. Does there exist a

constant-amortized-time solution that uses only space polynomial in the dictionary size?

In this chapter, we study the amortized complexity of the dictionary problem under

a multilevel hashing model that is based on Yao's cell probe model, and prove that

dictionary operations require log-logarithmic amortized time in this model.

2.1 Introduction

The Dictionary Problem is to maintain a dynamic set, called a dictionary, over a finite,

ordered universe U under the following operations:

â€¢ Search(i): Determine whether element x is in the dictionary; if so, find a memory

location storing i.

â€¢ Insert(i): Add element x to the dictionary.

â€¢ Delete(i): Remove element x from the dictionary.

5

6 CHAPTER 2. THE DICTIONARY PROBLEM

The dictionary is initially the empty set. We would like to know how fast the problem

can be solved using only space polynomial in the dictionary size (denoted by n ) on a

RAM with (loglf'l)-bit memory words. Here we are restricting the space used in terms

of the dictionary size in order to exclude the trivial bit vector solution and a family of

its generalizations, due to Willard [37], that process dictionary operations in constant

time and use space depending on the size of the universe.

The Dictionary Problem has been extensively studied and has a rich variety of solu-

tions.

Balanced search trees solve the problem in O(log n ) time per operation and use 0(n)

space [1,21,23,32]. Self-adjusting search trees such as Sleator and Tarjan's Splay Tree

[28] and Galperin and Rivest's Scapegoat Tree []x] take O(logn) amortized time per

operation. There is no hope of improving the log n behaviour of search trees since they

use only comparisons to perform searches.

Hashing schemes solve the problem in constant e.xpected time per operation and

use Gin) space. Uniform hashing performs dictionary operations in constant expected

time when the operations are randomly chosen from a uniform distribution [21,1]; the

space used is 0(n). Universal hashing is an improved randomized hashing scheme that

performs searches in constant expected time and performs updates in constant expected

amortized time [8]; the expectation, here, is over the random hash functions chosen

by the algorithm and the bounds apply to all possible operation sequences. Dynamic

perfect hashing [13] (see also [2,14]) is a further improvement that performs searches in

constant worst-case time and performs updates in constant expected amortized time.

AH of the above hashing schemes fall under the general category of multilevel hashing

schemes. Roughly speaking, a multilevel hashing scheme consists of a hierarchically

organized system of hash functions that successively partition the dictionary into finer

pieces until all elements in the dictionary have beÂ«>n separated plus an algorithm to

update the configuration after each dictionary opÂ«Tation.

The fastest deterministic solution to the problem, at present, is a multilevel hashing

scheme, due to Fredman and Willard [17], that takos 0(!ogn/ loglogn) amortized time

per operation and uses 0{n) space.

It has been possible to prove nonconstant lower bounds for the problem on certain de-

terministic hashing models. Dietzfelbinger et al. [13] showed a tight bound of 0(logn) on

2.1. INTRODUCTION 7

the amortized complexity of dictionary operations under a wordsize-independent multi-

level hashing model that does not include search trees. Mehlhorn et al. [24] strengthened

this model so that it includes search trees and showed that dictionary operations have

0(loglogn) amortized complexity under their model. These models assume that mem-

ory locations have an unbounded wordsize and overestimate the costs of operations; this

simplifies the task of proving lower bounds but the models are not realistic. If memory

locations have a sufficiently large wordsize, the problem is solvable in constant time per

operation (Paul and Simon [25]; Ajtcii et al. [4]). When memory locations can only rep-

resent a single element of the universe, however, the best available solution is Fredman

and Willard's O(logn/ loglogn)-time solution [17].

We prove a nonconstant lower bound for the Dictionary Problem under a multilevel

hashing model, based on Yao's cell probe model, that takes into account the limited

wordsize of memory locations and realistically measures the costs of operations.

We define a generic multilevel hashing model for solving the Dictionary Problem

from which various lower bound models for the problem can be derived by specifying

suitable cost measures for the operations. The model has a vertically organized memory

that consists of a root location at level 1 and an array of at most m locations at level i,

for each i > 2. Memory locations have b bits each, for some b > \og\U\. Each memory

location stores some c different elements of the universe plus a (6 - clog|f/|)-bit value

that guides search operations; here the number of elements stored at a location varies

over time, and is not constant. A search operation locates an element in memory by

performing a sequence of memory probes: the first probe of the sequence always reads

the root location and the i-th probe, for i > 2, reads a location at level i that is com-

puted as a general function of the sequence of z â€” I values so far read and the element

being searched. The operation either locates the element in some location after a series

of probes or concludes that the element is not in the dictionary after a series of unsuc-

cessful probes. A search operation finally replaces the current memory configuration by

a new configuration, representing the same dictionary, by computing a general function

of the current configuration. An update operation simply replaces the current memory

configuration by a new configuration that represents the new dictionary by computing a

general function of the current configuration.

8 CHAPTER 2. THE DICTIONARY PROBLEM

We define a lower bound model for the problem by imposing a measure of operation

costs on the generic model. The Hamming cost of an operation is defined as the maximum

number of locations at any level whose contents change due to the operation. The cost

of a search operation is defined as the number of read probes performed by the operation

(called the search cost of the operation) plus the Hamming cost of the operation. The

cost of an update operation is defined as its Hamming cost. When this cost measure

is imposed on the generic model we get a lower bound model called the Hamming cost

model.

We prove our lower bound in the following model that has a different cost measure

from the Hamming cost model. The search path of an element x of (/ in a memory

configuration C is defined as the sequence of locations probed by a search operation on

X performed in configuration C. We say that an operation refreshes a memory location

/ if there is some element x in the final dictionary such that / lies on the search path

of X after the operation but / did not lie on the search path of x before the operation.

The refresh cost of an operation is defined as the maximum number of locations at any

level that get refreshed by the operation. Define the cost of an operation as the sum of

the search cost and the refresh cost of the operation. The lower bound model with this

cost measure is called the refresh cost model. The difference between this model and

the Hamming cost model is that the refresh cost measure is sensitive to locations that

participate in rerouting the search paths of dictionary elements during an operation,

on the other hand, the Hamming cost measure is sensitive to locations whose contents

change.

A nonconstant lower bound for the Dictionary Problem in the Hamming cost model

would translate into a similar lower bound for the problem in the cell probe model. We

believe that such a lower bound exists, but we can only prove a lower bound under

the refresh cost model. The refresh cost model seems to be incomparable in power to

the cell probe model and to the Hamming cost model. The justification for the refresh

cost model is that, in realistic hashing schemes, an operation might have to read, and,

if necessary, modify, the locations it refreshes in order to correctly reroute the search

paths of dictionary elements. The refresh cost model encompasses many of the known

solutions to the dictionary problem, but not all of them; for instance, the model includes

B-trees, red-black trees, and various hashing schemes, but the clatss of rotations- based

2.2. SINGLE-LEVEL HASHING MODEL 9

search trees is not included.

Our lower bound is given by:

Theorem 2.1 Consider the refresh cost multilevel hashing model with at most m mem-

ory locations per level, a universe U, and a wordsize b. Let n be a positive integer

satisfying \U\ > m'Â°8 'Â«". Consider dictionary operation sequences during which the

Online Library → R. (Rajamani) Sundar → Amortized complexity of data structures → online text (page 1 of 7)