David Wood.

Parallel queues and pools, an evaluation online

. (page 1 of 4)
Online LibraryDavid WoodParallel queues and pools, an evaluation → online text (page 1 of 4)
Font size
QR-code for this ebook

Parallel Queues and Pools, An Evaluation

David Wood

Ultracomputer Note #150
January, 1989

This work was supported under National Science Foundation grant No. DCR-84 13359.
* The initial version of this paper was submitted as my Master's Thesis


Introduction 1

The NYU Ultra Computer 1

What is a Parallel Queue? 1

Some Synchronization Routines 2

Queue/ Pool Overview 4

Singleton Queues

Llist 5

SUist 5

Queue 6

Hqueue/Dhqueue Commonalities 6

Hqueue 7

Dhqueue 8

Fqueue 10

Vqueue 10


Mqueue 11

Tqueue 12


Pool 14

Set 14

Cset 15

Cqtst 17

Multiple Queue Testing 17

Variable Queue Size 18

Multiple Copy Insertion 18

Interior Removal 18

Variable Number of Spawned Processes 18

Queue "Exercising" 18

Timing 19

Output 19

Fifb 22

Testing and Results 23

Data Plots 24

Discussion 34

Singleton Queues 34

PE Contention 35

Multi-Queues 35

Pools 36

Tqueue Performance vs. K-ariness of Tree 36

Conclusions 37

Future Work 37

Appendix 38

Bibliography 38



In this paper I examine a fairly broad range of parallel queue and pool (unordered queue)
algorithms and present data indicating their relative strengths and weaknesses. The algorithms
have been implemented on the NYU Ultracomputer prototype, an eight processor MIMD, shared
memory single bus machine. The data have been normalized to remove the effects of bus conten-
tion. I discuss the implementation of a few of the basic synchronization primitives and the imple-
mentation of each algorithm to be tested. Two programs were written, one rqtst provided a gen-
eralized arena for timing the different algorithms under different circumstances. The second pro-
gram, fifo, was used to test the parallel fifo-ness of the queues, that is, to test whether the given
algorithm truly represented a queue.

The NYU Ultracomputer

The Ultracomputer is an eight processor, multiple instruction - multiple data (MIMD),
shared memory machine. Each processing element (PE) is connected to SMB of shared memory
through a single 32-bit bus. Each PE consists of a single Motorola 68010 microprocessor with 32
KB of private cache memory. The PEs use software long integer multiply and divides. A PDP
1 1/34 is used to perform I/O for the Ultracomputer.

The basic primitive fetch-and-and, to be discussed shortly, is supported in hardware at the
shared memory end of the bus. All other fetch-and-phi operations (store and or) are implemented
in software through the use of fetch-and-add.

Dynamic shared memory allocation by spawned processes is not supported in the current
system. In the algorithms that required this, I simulated it by allocating enough memory before
spawning. This means of course that the totd memory size was larger than it had to be in these

For more information on the Ultracomputer I refer you to references |7| and [8). Reference
[7] in particular provides a very nice overview of software and hardware considerations concerning
the Ultracomputer.

What is a Parallel Queue?

Queues and pools are basic elements in any multi-programming system and are used in
many applications. They comprise an important and heavily used data structure in operating sys-
tems and so should be relatively light weight (in memory) and as fast as possible. There is, of
course, a trade off between memory and speed. In a serial environment, speed is measured strictly
by the number of machine instructions executed on each insert and delete (this discussion a.ssumes
the data structure is neither full nor empty). In a parallel environment, one must consider the
added time due to synchronization. For example, a linked list may be a fast serial algorithm, but
when synchronization is added to accommodate parallelism, linked lists become a relatively poor
queue. This is because each of the processes must serialize to gain access to the list. We will say
that an algorithm is highly parallel if the time it takes n processes to in.sert in parallel, is the same
within a constant, to the time it takes one process to insert serially.

The term queue implies an ordering to the items in the queue corresponding to the order in
which each was inserted. What does it mean to be ordered when many processes are inserting

- 1-

items onto the queue concurrently? To discuss parallel queues then, we must first define what it
means to be ordered. Gottlieb, Lubachevsky and Rudolf (GLR) in [51, defined the characteristics
of a parallel queue as follows,

// insertion of data item p is completed before iruertion of another data item q is
started, then it must not be possible for a deletion yielding q to complete before a
deletion yielding p has started.

This implies that once an item is on (allocated a position in) the queue, it must be assigned to a
delete process before any later items in the queue are assigned a delete process. This does allow
for later items in the queue to be returned earlier than previous items on the queue, however. Due
to scheduling, the process assigned to delete q may actually complete before the process assigned
to delete p.

Some Synchronization Routines

GLR discuss the various synchronization primitives used in parallel algorithms, the most
basic of which is the atomic operation fetch-and-add, also referred to as/aa. The result oi faa(i,a)
is to increment / by a, and return the original value of /. This primitive is used to implement the
test-modify-retest paradigms tdr (test decrement retest) and tir (test increment retest), in which a
variable is tested to see if it will exceed some bound when either decremented or incremented,
respectively. A bound of zero is assumed for tdr. Their implementations are as follows,

boolean function tir(s,delta,bound)
tir = false
if (s + delta < - bound) then

if (faa(s,delta) < = bound)

tir = true

end tir

boolean function tdr( - 0) then

if (faa(s,delta) > " 0) then

tdr " true

end tdr

Race conditions and starvation, which are present in this implementation, arc discussed in GLR.
Tdr and tir will be used extensively to see if the queues/pools are empty or full, respectively, and
can therefore be deleted from or inserted on to.

The generalized counting semaphores P and V are used, most importantly, to obtain
exclusive access to a data structure, such as the linked lists above. F.ach semaphore is imple-
mented as an integer that is initialized to a value that depends on the algorithm. The following are
the implementations of P and V,

procedure P(semaphare,dclta)

repeal until tdr(semaphore,delta)
end P

procedure V(semaphore, delta)

end V

In the linked list example, the integer would be given the initial value one. The list would be
obtained using P(semaphore,\) and released with V{semaphore,\).

- 2-


The remaining three synchronization primitives are more involved and will only be discussed
to describe their semantics.

TTie first is the group lock synchronization algorithm developed by Dimitrovsky in [6]. This
following set of routines

glock(lock)/gunlock(lock) - enter or leave a group

gsync(lock) - synchronize the group

provides for the arbitrary grouping of processes at a boundary {glock or gsync). Clock acts like a
waiting room in which processes wait (busy-wait) for a door to open at which point they all go
through the door. TTie door will not open, until the previous group of processes has exited
through gunlock. Gsynch is used between the glock and gunlock calls to resynchronize the grouped

The second is something called a reader-writer lock. In this case there is a lock associated
with a data structure that can be read by any number of processes when no process is writing (or
waiting to write) to the structure. At any time any reader may become a writer, at which time all
requests for read locks are denied access and forced to busy-wait until the writer has released its
lock. The new writer may not proceed until all current readers have released their locks, and there
may only exist one writer at any given time. A naive use of a reader-writer lock in queues would
be to avoid integer overflow resulting from faa on position counters in the queue. When the
counter reaches some critical value, a reader lock is upgraded to a writer lock, which waits for all
current readers to finish, and then modifies the counter. Note that this needn't be done with a
faa since ail other modifiers of the counter have been locked out.

The set of required reader-writer routines is as follows.

rw_initlock(lock) - initialize a reader-writer lock

rw_rlock(lock)/rw_runlock(lock) - obtain/release a reader-writer read lock

rw_wlock(lock)/rw_wunlock(lock) - obtain/release a reader-writer write lock

read2write(Iock) - change from a read lock to a write lock

write2read(lock) - change from a write lock to a read lock

Their implementations can be found in GLR, [10|, |1| and [2|.

The last is the reader-reader lock in which there arc two groups of processes A and B, that
can not access a data structure at the same time. That is, the data structure will not be accessed
by processes from both groups concurrently, either group A or group B can access, but not both.
Arbitrarily one group is given priority, which can lead to starvation. A reader-reader lock is useful
in dynamic hash queues in which the size of the hash table is allowed to shrink and expand. One
would not want both an expand and a shrink taking place simultaneously, so a reader-reader lock
is used in which group A are the expanders and group B the shrinkers.

The set of required reader-reader routines is as follows.

rr_initlock(lock) - initialize a reader-reader lock

rr_Alock(lock)/rr_Aunlock(lock) - obtain/release a reader-reader A lock

rr_Block(lock)/rr_Bunlock(lock) - obtain/release a reader-reader B lock

Their implementations can be found in (91, |1| and 12].

- 3-

Queue/Pool Overview

TTie next few sections describe the algorithms that were tested. I treat some of the older ones
less thoroughly and go into more depth with the newer ones. There are three classes of algo-
rithms. The singleton queues are FIFO queues that do not support multiple item insertion, while
the second type, multi-queues, do support this feature. The third type, which is not a queue but
rather a grouping of items, I have been referring to as a pool. All algorithms use busy-waiting
synchronization primitives as opposed to the non-busy-waiting (or blocking) type.

I define the following variables in order to quantify the various algorithms' performance.

m - is the number of queues,

M - is the maximum number of queues that can be supported.

n - is the number of items that may appear on all m queues at one time (not counting

A' - is the maximum number of items that can be supported on one queue.
p - is the amount of parallelism to be supported.

I have used these quantities to define each algorithms' memory and time requirements in Table 1
below. Where a quantity is not defmed above, see the description of the algorithm.












0(m + n)
0(px m-l-n)
0(m + n + hsize)
0(m + n -1- hsize(n))
0(m X n)
0(m X n)




0(2 + 2 X MaxLF)





0( 1 + n/hsize)

0(2 + 3xMinLF/2)




0(p xm + n)
0(n+m X (log^n+ 1))


O(l + log^n)


0(1 +log^n)




0(m X n)

0(m X n/wordsize)

0(p xm + n)



/ hsizefl li Che fize of the hash table.

2 Tfme comptexities ihown are wont cast. n/MwLF s hstze i n/MlnLF

3 Ptrformanct dtpwnds upon th§ multiplicity (C) of tha ittmi. Whwn p < C, Um» depmdtmcy opproeehin 0(1 }, but in tht worst cost whtrt c " /, th* lim* depen-
dmey is 0(p} on optrogt. This is because tbt proctss&s must s*riehjt to obtain sucetssivt thmmts

4 Tht numbtr of ehildrtn of toeh nod* in tht trtt is vanoblt. For tht standard Itsts It - rt.

Table 1.

The time complexities shown in the table represent best case times, with exception to those for
dhqueue. I have assumed that queue/pool is not full when inserting or empty when deleting, and
no process has to wait for an item to be inserted or deleted (i.e. assume no cell contention).


Each queue/pool maintains a qsize counter to indicate the number of items currently on the
queue. On insertion, either tir{qsize,nitems,maxqsize) or faa{qsize,nitem.^) is performed. The lir is
used when the queue is of fixed size, and if it fails insertion fails. Except in the case of multi-
queues, the value of nitems will be one. On deletion, each queue or pool performs a ldr{qsize,\)
to see if there are any items available. If the tdr fails, deletion fails.

Each algorithm uses head and tail counters as indices into the queue/pool to delete from or
insert on to, respectively. Eventually, these counters will overflow if some action is not taken to
prevent this. In some cases (the hash queues for example), overflow is not a problem. Overflow
will only cause a discontinuity in the indices; some queues/pools are more sensitive to this discon-
tinuity than others and therefore may require an adjustment of head and tail as the data structures
age. In the algorithms that can't handle overflow, in general, a check is made on each/aa of head
or tail to see if the value has reached some predetermined bound. If so, the counter is decre-
mented by bound with a faa.

All of the queues/pools, with exception to the fixed sized singelton queues/pools, provide
the ability to prematurely remove an item from the data structure regardless of its position there
within. In all cases this involves leaving what is called a "hole", a temporary place holder that
when encountered wiU be deleted from the structure but will not be returned as a deleted item.
Rather, deletion continues through the data structure until the first real item is found. It is neces-
sary to leave this hole in most cases because each item, once it is put on the queue is assumed to
be there and associated with a given position in the queue/pool. A place holder must therefore be
left in place of removed items. I have not attempted to test the various methods used for remo-
val, but I mention this aspect in order to be complete.

The pseudo-code presented below is meant to give the reader the basic concept in the algo-
rithm. I do not show the full details, because as one might expect the code could become less
readable and therefore less useful. As a result, and possibly among other things, I have left out
overflow handling and how to allow for interior removal. Except in the case of mqueue, for
which the extra code is included, the codes should not require added synchronization to support
interior removal. The code presented here is a good representation of that which was tested.

Singleton Queues


As mentioned earlier, a single linked list is not too useful a queuing algorithm in a parallel
environment. Distributed link lists are the basis of the algorithms of queue, hqueue, dhqueue and
mqueue, so I will mention its implementation here. It will also be useful as a reference when
comparing other algorithms' performance.

Through the use of PjV, a binary lock (semaphore) is held over the list so that only one
process can access the items from the list at once. This serializes all processes accessing the list
and for this reason will not support highly parallel programming.


The serial linked list was added to cqtst strictly as a reference. The code is identical to that
of Hist except that the synchronization primitives have been removed. This will provide us with a
feel for the relative speeds of the other algorithms, and possibly an idea of how much synchroni-
zation costs.



Queue is an implementation of a GLR-type queue of unlimited size supporting interior
removal. A circular array of linked lists is used to hold the queue items. The array size is user
definable, but should be chosen to be larger than the maximum amount of parallelism. A faa is
performed on a counter, head (for delete) or tail (for insert), that is used to index into the array of
lists. A binary lock is then obtained on the desired list and the item is either deleted or inserted.
Since items are placed at the end of each list, and remembering the circular nature of the array,
this queue Ccin be thought of as a spiral. Items are inserted at the outside and eventually spiral
down to the center where they are deleted. The pseudo-code for queue follows.


Initialize(q ,paralJelism)



q.cnt : =

mytail : = faa(q.lail,l)

if (not tdr(q.cnt,l))

q.tail :=

mylist := mytail mod q.size

relurn(qucue empty)

q.head : =

list:= (address o0q.lists|mylisl]

myhcad : = raa(qhead,l)

q.size := parallelism


mylisl : = myhead mod q.size

for i : = to q.size- 1 do begin

until (list- > pulticket = mytail)

list : = (address or)q.lists(mylist]

q.ljsts(i].cnt :=



q,lists(i).putticket : = i

- put item at end of list

until (list- ^ getticket = myhead)

q.listslij.gettickel := i




faa(list- > putticket.q.size)

until (ldr(lisl-^cnt,l))





- get Item from head of list
raa(lisl- > gellicket.q.size)

The time complexity of this queue is 0(1) for both deletion and insertion. It takes constant
time to fmd the correct list, and with the correct pointers to the beginning and end of each list, list
manipulation is also constant time. The memory complexity is 0(m*p + n). It is likely that
within an operating system m, the number of queues will rise at least linearly with, p the amount
of available parallelism. This means that the memory requirements grow quadratically with p.
This is an important resuh and should be paid close attention to, as some of the other queue
algorithms have more amenable memory requirements.

Hqueue/Dhqueue Commonalities

Hqueue and dhqueue are two hash queue algorithms that are very similar to that of queue
except that they can support many queues on one hash table (circular array). Like queue these
allow for an unlimited size queue and interior removal. The hash queues differ from queue in that
each item on each queue is tagged with a position in the given queue. This avoids the need to
keep the lists ordered, but requires deletes to search the list for the correct item. Dhqueue differs
from hqueue in that it utilizes a dynamic hash table that can both expand and contract. Because
these are new algorithms {hqueue is attributed to Isaac Dimitrovski in reference [1]), I will discuss
them in more detail than the previous algorithms.

Each queue in a hash table is given a unique identifier mapped from the range 1..A/-I- 1,
where M is the maximum number of queues that may use the table at once. The extra value is
used to identify an internal queue of independent queue items that may be exchanged for a
removed item. These queue identifiers are used in the hash function to distinguish similar items
in different queues and to keep queues from tracking one another. Tracking occurs if the rth ele-
ment of queue ql and the j element of queue q2, fall in the same bucket, and the (/+ l)th and

(/■+ l)th elements of queues ql and q2, respectively, also fail in the same bucket,
hash function used is as follows,

The generalized

hashval = (qid * pos) mod hsize,

where pos is obtained from a faa on either head or tail, accordingly. The hash function for
dhqueue, to be discussed shortly, is slightly different but u.ses a similar concept to avoid tracking.

Along with the notion of the queue identifiers (ids) comes the need to keep track of the
available ids. In other algorithms, to reuse a queue one can just re-initialize it, but with these two
algorithms the queue ids must be made available again through a separate function called dispose.
Disposing of a queue mainly involves putting the id back in some kind of set that is associated
with the hash table. The current algorithm uses the cset algorithm, which will be discussed
shortly. Queue initialization then, involves labeling the queue with an id obtained from the set.

Interior removal is handled in a unique way. A separate internal queue is maintained whose
items are used to pass back as copies of the removed items. Upon queue initialization the user
indicates the number of removals (holes) that the queue will support. Only one internal queue is
used to support the removals of all the queues however, and so as a result a queue may get more
or less removals than were requested at initialization. This scheme is a little more fair than the
one used in queue, which requires a doubly-linked list and a counter in each item indicating the
number of times an item was removed immediately after it in its linked list (this was not shown in
the pseudo-code for queue). One pays for this even if removes aren't required, whereas in the
{d)hqueue implementation, one only suffers memory and time penalties when actually using this


Predominantly, the algorithm discussed above is hqueue. There are a few differences
between it and dhqueue that were not mentioned, though. A performance critical difference is the
fact that an hqueue deletion search for the desired item in the bucket outside of the critical Pj V
section. This is in contrast to that of dhqueue and should provide for a better efficiency. Hash
table initialization requires the table's size (i.e. fixed size table). Since the hash function involves
mod'ing the table size, table size should be given as prime. The pseudo-code is shown below.



Insert (q, item)


q.cnt : =

q.tail :=

q.head : =

q.id : = csel_delete(cset);

pos := faa(q.tail,1)

mylist := hash(q,pos)

list := (address of)q.lists|mylist|

- mark item with q.id and pos


if (not tdr(q.cnt,l))

rcturn(queue empty)
pos : = faa(q.head,l)
mylist : = ha.':h(q,pos)
list : = (address oOq.lists|mylist|

- put item on list


- search list for item with
(i.qid = q.id and i.pos = pos)

until (item is found)

- get item from list

Memory requirements are 0(/u/ze+ A/+ ri), where hsize is the size of the hash table, and n is
the number of items on at most M queues using the hash table. Notice that the complexity is
linear in the number of queues and/or processors, assuming M'=c*p. In.sertion is 0(1) time, but


deletion becomes 0( I + n/hsize) since each operation requires searching a list for an item from the
designated queue.


Dhqueue, as mentioned above, utilizes a hash table that can grow and shrink as the load on
the table changes. The algorithm used is a merging of the Dimitrovsky hash queue above and a
serial dynamic hash table algorithm discussed in reference [4). The serial dynamic algorithm is
based on a linear expansion method, in which each bucket in the table is divided or merged once
per cycle. A cycle is represented by the amount of splits/merges required to double/halve the
table size. Each bucket in the table is split/merged once per cycle. A counter p, that indicates the
next bucket to be split/merged, is initialized to zero and increment/decremented with each
split/merge. When a bucket is split/merged, a new/old bucket with index = p+maxp is used as a
buddy. Maxp is the current maximum value that p can obtain before it is reset to zero and the
table begins its next phase of doubling. When p is decremented to zero, divide maxp by two and
set p to the resulting value. When p is incremented to maxp, p is reset to zero and the value of
maxp is doubled. A compile time constant, a power of 2, is chosen as a minimum table size, so
that the value of maxp is always a power of 2.

Memory, a group of buckets designated as a segment, is allocated in pieces that double in
size, up to some maximum, with each allocation. This was not the only strategy available, but it
seems reasonable to always allocate space in proportion to the current size of the table.

The hash function is similar to that of hqueue, using the queue identifier to eliminate track-
ing, but is also dependent on the values of p and maxp as is shown in the pseudo-code in the
table below. If the first value of hashval is less than p, the corresponding bucket has been split. If

1 3 4

Online LibraryDavid WoodParallel queues and pools, an evaluation → online text (page 1 of 4)