Copyright
R. (Rajamani) Sundar.

Amortized complexity of data structures online

. (page 1 of 7)
Online LibraryR. (Rajamani) SundarAmortized complexity of data structures → online text (page 1 of 7)
Font size
QR-code for this ebook


Computer Science Department



TECHNICAL REPORT



Amortized Complexity of Data Structures

Rajamani Sundar
Technical Report 561

May 1991



NEW YORK UNIVERSITY



JVC

I C

IK m

|E^ e
(0
•1—1
|(_) (0
W OS

\(^

Is ^
O u

in c

CO



>1

■H

a; CO

e a
o +J
o o

a

'O u

MT ttmuTum usmar



Amortized Complexity of Data Structures



Rajamani Sundar



October 1991



A dissertation in the Department of Computer Science submitted to the faculty of
the Graduate School of Arts and Science in partial fulfillment of the requirements for
the degree of Doctor of Philosophy at New York University.



Approved




(Signed)

Ravi Boppana
Research Advisor



© Copyright by Rajamani Sundar 1991
All Rights Reserved



Amortized Complexity of Data Structures

Rajamani Sundar
Advisor: Ravi Boppana

Abstract



This thesis investigates the amortized complexity of some fundamental data structure
problems and introduces interesting ideas for proving lower bounds on amortized com-
plexity and for performing amortized analysis. The problems are as follows:

• Dictionary Problem: A dictionary is a dynamic set that supports searches of ele-
ments and changes under insertions and deletions of elements. It is open whether
there exists a dictionary data structure that takes constant amortized time per
operation and uses space polynomial in the dictionary size. We prove that dictio-
nary operations require log-logarithmic amortized time under a multilevel hashing
model that is based on Yao's cell probe model.

• Splay Algorithm's Analysis: Splay is a simple, efficient algorithm for searching
binary search trees, devised by Sleator and Tarjan, that uses rotations to reorganize
the tree. Tarjan conjectured that Splay takes linear time to process deque operation
sequences on a binary tree and proved a special case of this conjecture called the
Scanning Theorem. We prove tight bounds on the maximum numbers of various
types of right rotations in a sequence of right rotations performed on a binary
tree. One of the lower bounds refutes a conjecture of Sleator. We apply the upper
bounds to obtain a nearly linear upper bound for Tarjan's conjecture. We give two
new proofs of the Scanning Theorem, one of which is a potential-based proof that
solves a problem of Tarjan.

• Set Equality Problem: The task of maintaining a dynamic collection of sets under
various operations arises in many applications. We devise a fast data structure
for maintaining sets under equality-tests and under creations of new sets through
insertions and deletions of elements. Equality-tests require constant time and set-
creations require logarithmic amortized time. This improves previous solutions.



Acknowledgements



It is a pleasure to acknowledge the help of many people in performing the work of this
thesis.

I would like to thank my advisor Ravi Boppana for teaching me many useful research
skills and teaching/presentation skills. I learnt from him the importance of simple and
clear formulation of ideas, and about several useful ideas in Theoretical Computer Science
and Discrete Mathematics. The work on the Dictionary Problem owes a great deal to
the various things I learnt from him and to his help. He showed me the initial directions
to proceed along, he came up with fresh approaches to pursue whenever I failed, and he
kept encouraging me to succeed.

I would like to thank my co-advisor Richard Cole for initiating me into the area of
amortized analysis, for always being a source of inspiration, and for always giving me
kind attention and advice when I needed them. The work on the Deque Conjecture owes
greatly to the several fruitful discussions I had with him.

I would like to thank Mike Fredman, Bud Mishra, and Bob Tarjan for serving on my
thesis defense committee and for their role in this work. Mike Fredman and Bob Tarjan
supported and guided me during the summer of 1989 when I was a student under them
at Bellcore and at Bell labs, respectively. They were always sources of nice problems to
work on, and of inspiration and encouragement. The work on the Set Equality Problem
is part of a larger joint-work done with Bob Tarjan. Bud Mishra supported me during
the academic year 1988-89, and was always a source of encouragement and help.

1 would like to thank the Courant Institute community for providing a great envi-
ronment for this work.

Finally, I would like to thank IBM Corporation for graciously supporting me during
the academic vears 1989-91.



Contents



1 Introduction 1

1.1 Amortized Complexity 1

1.2 Our Work 3

2 The Dictionary Problem 5

2.1 Introduction 5

2.2 Single-level Hashing Model 9

2.2.1 Uniform Hash Functions and Worst-case Complexity 11

2.2.2 Nonuniform Hash Functions 12

2.2.3 Amortization 13

2.3 Multilevel Hashing Model 13

2.3.1 Partial Hashing Model 15

2.3.2 Adversary 16

2.3.3 Two Random Sampling Lemmas 21

2.3.4 The Worst-case Lower Bound 27

2.3.5 Amortization 30

3 The Deque Conjecture 35

3.1 Introduction 35

3.1.1 The Splay Algorithm and Its Conjectures 35

3.1.2 Terminology 39

ill



3.1.3 Previous Works 41

3.1.4 Our Results 41

3.2 Counting Twists, Turns, and Cascades 43

3.2.1 Upper Bounds 43

3.2.2 Lower Bounds 51

3.3 An Upper Bound for the Deque Conjecture 62

3.4 New Proofs of the Scanning Theorem 65

3.4.1 A Potential-based Proof 66

3.4.2 An Inductive Proof 68

4 Testing Set Equality 75

4.1 Introduction 75

4.2 The Data Structure 78

4.3 The Analysis 79

4.4 Directions for Further Work 83

Bibliography 85



IV



Chapter 1



Introduction



1.1 Amortized Complexity

In many applications of data structures, the data structure is embedded within some
algorithm that performs successive operations on it. In these applications, we are inter-
ested only in the time taken by the data structure to process operation sequences as a
whole and not in the time spent on isolated operations. Amortized data structures are
data structures tailored to such applications: these data structures may perform poorly
on a few individual operations but perform very well on all operation sequences. The
natural performance measure for an amortized data structure is its amortized complexity,
defined to be the maximum cost of operation sequences performed on the data structure
as a function of the lengths of the sequences. Amortized data structures are appealing
because they dispense with complicated constraints and associated information present
in data structures that achieve a fast performance on all operations and they use simple
reorganizing heuristics, instead, to achieve a fast amortized performance. Some exam-
ples of these data structures are the compressed tree data structures for the Union-find
Problem [1,23,32], the Splay Tree [23,28,32], and the Pairing Heap [16].

Amortized data structures are simple to describe but their performance analysis is
often quite Involved. Since operation sequences on these data structures are mixtures
of operations of varying costs that very finely interact with one another it is tricky to
accurately estimate their amortized complexity. Of the three amortized data structures
mentioned above only the first one has been analyzed thoroughly; even its analysis was
accomplished only several years after the data structure was originally conceived. The



2 CHAPTER 1. INTRODUCTION

complete aneilysis of the other two is still open.

A useful framework for performing amortized analysis involves defining an appro-
priate potential function for the data structure [34]. In this framework, each state of
the data structure is assigned a real number called its potential. The amortized cost of
a data structure operation is defined to be the actual cost incurred by that operation
plus the increase in potential it causes through change of state. The total cost of an
operation sequence performed on the data structure equals the total amortized cost of
the operations in the sequence plus the decrease in potential from the initial state to the
state at the end of the operation sequence. Choosing a suitable potential function that
yields a sharp estimate on the amortized complexity is a task that demands ingenuity.

We illustrate the potential framework through a simple example. A priority queue
with attrition (PQA) is a dynamic set of real numbers supporting two operations:
Deletemin deletes and returns the smallest element of the set; Insert(i) deletes all
elements of the set greater than x and adds x to the set. A PQ.4 can be represented
by a sorted list of its elements. In this representation, Deletemin takes constant time,
and Insert takes time proportional to the number of deleted elements. If we define the
potential of the data structure equal to the number of elements it contains then PQA
operations take constant amortized time; PQA operation sequences take linear time.

The notions of amortized complexity and amortized cost are usually used in a much
wider sense than defined above. Amortized complexity is usually a maximizing function
of several parameters of the operation sequences, instead of their length alone. The
amortized costs of operations are usually a set of functions of the operation sequence
parameters that, when added together, yield a good estimate of the amortized complex-
ity. For example, the compressed tree data structures for the Union-find Problem have
amortized complexity equal to 0{n + ma(m -\- n,n)), and take constant time on Union
operations and 0{a(m + n,n}) amortized time on Find operations, where a(m, n) is an
inverse function of the Ackerman function, and m and n respectively denote the total
number of Finds and the total number of Unions in the operation sequence [23,32].

Let us relate amortized complexity to other measures of data structure performance.
There exist data structure problems whose amortized complexities are lower than their
worst-case complexities; for instance, the Union-find Problem is solvable in nearly con-
stant amortized time per operation but requires nearly logarithmic worst-case time per



1.2. OUR WORK 3

operation [7,15,32]. There probably exist problems whose randomized (or average-case)
complexities are lower than their amortized complexities, but this is yet to be proven; for
instance, the Dictionary Problem and certain other problems that involve testing equal-
ity of objects appear to be good candidates for this class. The amortized complexity of
a dynamic data structure problem is often intimately related to the complexity of its
static version via transformations of solutions/adversaries of one problem into another
[6,39].

An appropriate model for proving lower bounds on the amortized complexity of a
data structure problem is Yao's cell probe model [38]. This model abstracts out the
arithmetic and indexing capabilities of random access machines without ignoring their
word-size limitations. The model has a memory consisting of an array of 6-bit locations.
Data structure operations are performed in a series of memory probes in which the next
probe location is always computed as a general function of the values so far read. In this
model, Fredman and Saks [15] proved tight lower bounds on the amortized complexity
of many problems, including the Union-find Problem. The only other interesting lower
bound known in this model is for a static data structure problem, due to Ajtai [3]. The
complexity of many other problems, notably, the Dictionary Problem and the Priority
Queue Problem, remains unexplored.

This completes our introduction to amortized complexity. Further information on
this subject can be found in [23,32,34].

1.2 Our Work

This thesis investigates the amortized complexity of some fundamental data structure
problems. We introduce interesting ideas for proving lower bounds on amortized com-
plexity and for performing amortized analysis that enable progress towards resolving
some open questions. The problems studied are as follows.

In Chapter 2, we study the amortized complexity of the Dictionary Problem. A dic-
tionary is a dynamic set that supports searches, insertions, and deletions of elements. It
is an open problem whether a dictionary can be maintained in constant amortized time
per operation using space polynomial in the dictionary size; we denote the dictionary
size by n. While hcishing schemes solve the problem in constant amortized time per oper-



4 CHAPTER 1. INTRODUCTION

ation on the average or using randomness, the best deterministic solution (uses hashing
and) takes 0(logn/loglogn) amortized time per operation. We study the deterministic
complexity of the dictionary problem under a multilevel hashing model that is based on
Yao's cell probe model, and prove that dictionary operations require fi(loglogn) amor-
tized time in this model. Our model encompasses many known solutions to the dictionary
problem, and our result is the first nontrivial lower bound for the problem in a reason-
ably general model that takes into account the limited wordsize of memory locations
and realistically measures the cost of update operations. This lower bound separates the
deterministic and randomized complexities of the problem under this model.

In Chapter 3, we study a problem arising in the analysis of Splay, a rotation-based al-
gorithm for searching binary search trees that was devised by Sleator and Tarjan. Tarjan
proved that Splay takes linear time to scan the nodes of a binary tree in symmetric order;
this result is called the Scanning Theorem. More generally, he conjectured that Splay
takes linear time to process deque operation sequences on a binary tree; this conjecture is
called the Deque Conjecture. We prove that Splay takes linear-times-inverse-Ackerman
time to process deque operation sequences. In the process, we obtain tight bounds on
some interesting combinatorial problems involving rotation sequences on binary trees.
These problems arose from studying a conjecture of Sleator that we refute. We give two
new proofs of the Scanning Theorem. One proof is a potential-based proof; Tarjan had
originally posed the problem of finding such a proof. The other proof uses induction.

In Chapter 4, we study the problem of maintaining a dynamic collection of sets under
equality-tests of two sets and under creations of new sets through insertions and deletions
of elements. We devise a data structure that takes constant time on equality-tests and
logarithmic amortized time on set-creations. The data structure derandomizes a previous
randomized data structure, due to Pugh and Teitelbaum, that took logarithmic expected
amortized time on set-creations.

Some of our work has been published before. The work on the Deque Conjecture was
reported in [30]. The work on the Set Equality Problem was reported in a joint-paper
with Robert E. Tarjan [31] that also dealt with other related problems.



Chapter 2

The Dictionary Problem



The Dictionary Problem is a classical data structure problem arising in many applications
such as symbol table management in compilers and language processors. The problem
is to maintain a dynamic set under the operations of search, insertion, and deletion of
elements. The problem can be solved in constant time per operation using a bit vector
that has a separate bit for each element of the universe indicating its prescence in the
set. This simple solution is unsatisfactory since it uses space proportional to the universe
size and the dictionary is usually very small compared to the universe. Does there exist a
constant-amortized-time solution that uses only space polynomial in the dictionary size?

In this chapter, we study the amortized complexity of the dictionary problem under
a multilevel hashing model that is based on Yao's cell probe model, and prove that
dictionary operations require log-logarithmic amortized time in this model.

2.1 Introduction

The Dictionary Problem is to maintain a dynamic set, called a dictionary, over a finite,
ordered universe U under the following operations:

• Search(i): Determine whether element x is in the dictionary; if so, find a memory
location storing i.

• Insert(i): Add element x to the dictionary.

• Delete(i): Remove element x from the dictionary.

5



6 CHAPTER 2. THE DICTIONARY PROBLEM

The dictionary is initially the empty set. We would like to know how fast the problem
can be solved using only space polynomial in the dictionary size (denoted by n ) on a
RAM with (loglf'l)-bit memory words. Here we are restricting the space used in terms
of the dictionary size in order to exclude the trivial bit vector solution and a family of
its generalizations, due to Willard [37], that process dictionary operations in constant
time and use space depending on the size of the universe.

The Dictionary Problem has been extensively studied and has a rich variety of solu-
tions.

Balanced search trees solve the problem in O(log n ) time per operation and use 0(n)
space [1,21,23,32]. Self-adjusting search trees such as Sleator and Tarjan's Splay Tree
[28] and Galperin and Rivest's Scapegoat Tree []x] take O(logn) amortized time per
operation. There is no hope of improving the log n behaviour of search trees since they
use only comparisons to perform searches.

Hashing schemes solve the problem in constant e.xpected time per operation and
use Gin) space. Uniform hashing performs dictionary operations in constant expected
time when the operations are randomly chosen from a uniform distribution [21,1]; the
space used is 0(n). Universal hashing is an improved randomized hashing scheme that
performs searches in constant expected time and performs updates in constant expected
amortized time [8]; the expectation, here, is over the random hash functions chosen
by the algorithm and the bounds apply to all possible operation sequences. Dynamic
perfect hashing [13] (see also [2,14]) is a further improvement that performs searches in
constant worst-case time and performs updates in constant expected amortized time.
AH of the above hashing schemes fall under the general category of multilevel hashing
schemes. Roughly speaking, a multilevel hashing scheme consists of a hierarchically
organized system of hash functions that successively partition the dictionary into finer
pieces until all elements in the dictionary have be«>n separated plus an algorithm to
update the configuration after each dictionary op«Tation.

The fastest deterministic solution to the problem, at present, is a multilevel hashing
scheme, due to Fredman and Willard [17], that takos 0(!ogn/ loglogn) amortized time
per operation and uses 0{n) space.

It has been possible to prove nonconstant lower bounds for the problem on certain de-
terministic hashing models. Dietzfelbinger et al. [13] showed a tight bound of 0(logn) on



2.1. INTRODUCTION 7

the amortized complexity of dictionary operations under a wordsize-independent multi-
level hashing model that does not include search trees. Mehlhorn et al. [24] strengthened
this model so that it includes search trees and showed that dictionary operations have
0(loglogn) amortized complexity under their model. These models assume that mem-
ory locations have an unbounded wordsize and overestimate the costs of operations; this
simplifies the task of proving lower bounds but the models are not realistic. If memory
locations have a sufficiently large wordsize, the problem is solvable in constant time per
operation (Paul and Simon [25]; Ajtcii et al. [4]). When memory locations can only rep-
resent a single element of the universe, however, the best available solution is Fredman
and Willard's O(logn/ loglogn)-time solution [17].

We prove a nonconstant lower bound for the Dictionary Problem under a multilevel
hashing model, based on Yao's cell probe model, that takes into account the limited
wordsize of memory locations and realistically measures the costs of operations.

We define a generic multilevel hashing model for solving the Dictionary Problem
from which various lower bound models for the problem can be derived by specifying
suitable cost measures for the operations. The model has a vertically organized memory
that consists of a root location at level 1 and an array of at most m locations at level i,
for each i > 2. Memory locations have b bits each, for some b > \og\U\. Each memory
location stores some c different elements of the universe plus a (6 - clog|f/|)-bit value
that guides search operations; here the number of elements stored at a location varies
over time, and is not constant. A search operation locates an element in memory by
performing a sequence of memory probes: the first probe of the sequence always reads
the root location and the i-th probe, for i > 2, reads a location at level i that is com-
puted as a general function of the sequence of z — I values so far read and the element
being searched. The operation either locates the element in some location after a series
of probes or concludes that the element is not in the dictionary after a series of unsuc-
cessful probes. A search operation finally replaces the current memory configuration by
a new configuration, representing the same dictionary, by computing a general function
of the current configuration. An update operation simply replaces the current memory
configuration by a new configuration that represents the new dictionary by computing a
general function of the current configuration.



8 CHAPTER 2. THE DICTIONARY PROBLEM

We define a lower bound model for the problem by imposing a measure of operation
costs on the generic model. The Hamming cost of an operation is defined as the maximum
number of locations at any level whose contents change due to the operation. The cost
of a search operation is defined as the number of read probes performed by the operation
(called the search cost of the operation) plus the Hamming cost of the operation. The
cost of an update operation is defined as its Hamming cost. When this cost measure
is imposed on the generic model we get a lower bound model called the Hamming cost
model.

We prove our lower bound in the following model that has a different cost measure
from the Hamming cost model. The search path of an element x of (/ in a memory
configuration C is defined as the sequence of locations probed by a search operation on
X performed in configuration C. We say that an operation refreshes a memory location
/ if there is some element x in the final dictionary such that / lies on the search path
of X after the operation but / did not lie on the search path of x before the operation.
The refresh cost of an operation is defined as the maximum number of locations at any
level that get refreshed by the operation. Define the cost of an operation as the sum of
the search cost and the refresh cost of the operation. The lower bound model with this
cost measure is called the refresh cost model. The difference between this model and
the Hamming cost model is that the refresh cost measure is sensitive to locations that
participate in rerouting the search paths of dictionary elements during an operation,
on the other hand, the Hamming cost measure is sensitive to locations whose contents
change.

A nonconstant lower bound for the Dictionary Problem in the Hamming cost model
would translate into a similar lower bound for the problem in the cell probe model. We
believe that such a lower bound exists, but we can only prove a lower bound under
the refresh cost model. The refresh cost model seems to be incomparable in power to
the cell probe model and to the Hamming cost model. The justification for the refresh
cost model is that, in realistic hashing schemes, an operation might have to read, and,
if necessary, modify, the locations it refreshes in order to correctly reroute the search
paths of dictionary elements. The refresh cost model encompasses many of the known
solutions to the dictionary problem, but not all of them; for instance, the model includes
B-trees, red-black trees, and various hashing schemes, but the clatss of rotations- based



2.2. SINGLE-LEVEL HASHING MODEL 9

search trees is not included.

Our lower bound is given by:

Theorem 2.1 Consider the refresh cost multilevel hashing model with at most m mem-
ory locations per level, a universe U, and a wordsize b. Let n be a positive integer
satisfying \U\ > m'°8 '«". Consider dictionary operation sequences during which the


1 3 4 5 6 7

Online LibraryR. (Rajamani) SundarAmortized complexity of data structures → online text (page 1 of 7)