Font size

communication load on many lines, etc..

A natural question to be asked is about polynomial time lucid-box

reducibilities; and, in particular, do they affect the theory of

NP-completeness? Unlike the design of efficient algorithms, where new

opportunities are opened, we show in the remainder of this section that

the answer to this question is essentially negative. The argument, as

we shall see, is simple. It is based on the following observation.

Given input variables and a program that operates on them it is an

arbitrary (semantical) decision which program variables are declared to

be outputs.

Extending the notion of lucid-box reducibility to Turing machines

and the related definitions of NP-completeness is straightforward and

therefore omitted. In the introduction we presented the definition of

-13-

[AHU] for NP-completeness. Obviously, every problem which is

NP-coraplete using Cook's reducibility , is NP-complete using [AHU]'s

reducibility. [Ev] implies that it is an open question if the other

direction holds as well. The following sentence relates to the

definition of NP-completeness of [AHU] , given above. Note that both

the program for recognizing Lq and its execution are transparent to us

in a lucid-box reducibility from the problem of recognizing L to the

problem of recognizing Lq . This seems to imply that our definition of

NP-completeness is equivalent to [AHU], since nothing can stop us from

using in any "effective" way the program for recognizing Lq.

We would like to elaborate a little on terminology which is being

used. [GJ] defines a search problem tt to consist of a set D^ of finite

objects called instances and, for each instance, I e D^ , a set S^(I)

of finite objects called solutions for I. An algorithm is said to solve

a search problem it if, given as input any instance I e D^ > it returns

the answer "no" whenever S^(I) is empty and otherwise returns some

solution s belonging to S^(I). Define, alternatively, a proceduFal

solution to ir to consist of an algorithm that solves tt including all

its intermediate computations. Assume that instead of algorithms that

solve search or decision problems we were interested in procedural

solutions of these problems. By definition, any call to a procedural

solution for some problem results in a listing of all its intermediate

computations throughout the T time units during which the procedural

solution ran (for some T). Right after the example in the

introduction, we remarked that a coroutine that runs in time 0(T) can

be simulated by a subroutine that runs in time 0(T ) by restarting it

instead of resuming its operation from the place it was stopped.

Therefore, if in a reduction of the procedural solutions of one

problem to the procedural solutions of another problem we used only

executions of solutions to the second problem we can simulate it by a

polynomial time Turing reduction. However, the (bizzare) possibility

of using the program for the second problem in other ways than

executing it still leaves the question of [Ev] "formally" open.

Remark . It might be interesting to define an "algebra of procedures"

as follows. Its elements (the procedures) will be pairs which consists

-14-

of a program and input and program variables (but must not have output

variables). Lucid-box compositions will serve as the operation of this

algebra since they allow for producing a procedure out of existing

ones. This will be analogous to an algebra of algorithms that can be

defined using black-box compositions. We do not elaborate on this idea

here and leave it for future research.

4 . On_ Structured Programming .

A growing number of programs, documentation of programs and

presentations of algorithms in the literature use principles of

structured programming for modularity and clarity of exposition.

A methodologically useful way to present algorithms in this spirit

is to start with a high-level description of the algorithm which

includes a milestones in their performance. In a hierarchical fashion

this overview is filled with details, until a complete low-level

accurate specification is obtained. For instance, take the

biconnectivity algorithm using depth-first-search (DFS) in [AHU] .

After presenting the DFS method for searching a graph (to which we

refer as a high-level description) this book presents again the DFS now

intermixed with the bookkeeping required for the biconnectivity

algorithm. An alternative approach that we would like to point out

looks at the DFS as a navigator that tells us where to go next, while

there is a bookkeeper that should be told where we are, in order to be

able to do the necessary bookkeeping. We suggest the following

description. There is a 'general manager' that coordinates between the

navigator ('president of the company') and the bookkeeper. The general

manager asks the president where to go, and then transmits the response

to the bookkeeper. The bookkeeper writes down whatever is required and

reports to the general manager when he is done. Then the general

manager consults the president again, and so on.

The modularity of the presentation is increased since now the

high-level description is an independent piece of software. Besides

the important advantage of clarity we would like to point out the

following possible advantage: In many cases a first version of an

algorithm is improved later by polishing its low-level implementation

(bookkeeping) only. It is desirable in such cases to utilize the

-15-

prevlous fault-free high-level specifications. If it is written as an

Independent unit it can be used as is.

[K] mentions the methodological importance of understanding

several coroutines as (symmetric) equal partners, unlike a master-slave

relation between a program and a subroutine it calls. The above

hierarchical description may well fit this understanding by naming a

new general manager who employs all coroutines involved. So no

coroutine dominates another.

Exa mples

Example I. Choice of a model of parallel computation.

The paper [V2] deals with the problem of choosing a theoretical

abstract model of parallel computation to be simulated by synchronous

distributed machines. The principle of choosing the most permissible

model of parallel computation as long as the cost of computational

resources does not increase is applied.

We sketch the main ideas in this paper emphasizing the ones that

relate to lucid-box compositions.

Our machine is assumed to be represented by a model of synchronous

distributed computations (SDC). It employs a sequence of RAM's (see

[AHU]) PpP2,...,Pj that operate synchronously in parallel. Each

processor can communicate directly with no more than c other processors

(where c is some "small" constant) through communication lines.

Communication registers which are associated with the lines are used

for the communication.

The concurrent-read exclusive-write parallel RAM (CREW PRAM) is a

synchronous model of parallel computation in which all p processors

Pi,...,P have access to a shared memory. Simultaneous access to the

same common memory location is allowed for read (but not write)

purposes. The Fetch-and-Add (F&A) PRAM model of synchronous parallel

computation allows every operation which is permitted by the CREW PRAM.

In addition, the following is allowed. Let A be a common memory

address and let ei be some address in the local memory of processor

P, . We define the F&A instruction as follows. If processor P,

performs F&A(A,e^) and no other processor performs at the same time an

instruction that relates to address A then the content of A is

-16-

transraitted Co processor P. and stored In one of its local registers

and address A is assigned with A+e, . Suppose that several processors

perform simultaneously F&A instructions that relate to A. The result is

defined to be as if these instructions were performed serially in some

order.

Generally speaking the main result of the paper is that every

simulation of algorithms given in the CREW PRAM into the SDC has a

counterpart simulation for algorithms given in the F&A PRAM into the

SDC which requires the same SDC-time and SDC-local memories up to a

constant factor. This supports the choice of the F&A PRAM as an

abstract model of parallel computation as was done for the

NYU-Ultracomputer [GGKMRS].

We sketch the main ideas that relate to lucid-box compositions.

Let us take a pulse of the F&A PRAM. Assume, that in this pulse

processors i^ ,i2 , . . . ,ii. want to execute F&A(Ai,ei), F&ACAo .eo) , . . â€¢ ,

F&A(Ai^,e. ), respectively, where A. is a common memory address and e. is

some address in the local memory of processor i. , for I < j < k. All

other processors of the F&A PRAM are assumed to remain idle during the

present pulse. Finally, we arrived at the point where the present

example relates to lucid-box compositions. We 'replace' the given

pulse by the following: every F&A(A-,e-) instruction for processor ij

is replaced by the insruction "read common address A- into local

address e^ of processor i-:", for 1 < j < k. We apply the simulation of

the CREW PRAM into the SDC in order to simulate this 'reading' pulse.

This simulation is done in auxiliary memory locations of the SDC and

all cases where the contents of a (memory or communication) location of

the SDC is copied into another are recorded in the local memory of the

processor that does it. Let A- (resp. e- be a memory location of the

SDC which simulates memory location A. (resp. e-). We assume that

there is exactly one such A. (resp. ^-i)* "^^^ correctness of the

simulation implies that the contents of location A. is propagated from

A. to ez between the time the simulation of the present reading pulse

begins until the time it ends. Say that it takes T cycles of the SDC

machine. We assume that this propagation occurs by repeatedly copying

the contents of A^ from one SDC memory location to another until it is

finally copied into e-. (No splitting of the bits of this contents or

-17-

encodlng of contents of several locations into one are allowed.) This

is called the copying assumption .

Let A be the set of all memory locations (in all local memories)

of the SDC. The following (layered) directed graph G(V,E) is

introduced as a tool for specifying the synchronization of the

remaining steps; edges between two successive layers represent

simultaneous events that take place between two successive ticks of our

'clock'. The set of vertices V of G is V = {(a,t) | a e A, < t < T} .

Layer t, L^, , of G is L^ = {(a,t) | a e A} for t, < t < T. Let

E^ = {[(a,t-l), (b,t)J I A (SDC) processor P^^ , for some 1 < i < s,

copied the contents of address a into

address b at time unit t, 1 < t < t}

E2 = {((a,t-l), (a,t)J I No processor wrote into address a at

time unit t, 1 < t < t}

The set of edges E is the union of E-^ and E2 â€¢ Note, that all edges of

E, were actually recorded at the local memories of corresponding

processors.

It is not difficult to observe that binary trees with A- as roots

and e- as leaves are subgraphs of G. The simulation of the F&A

instructions proceeds by identifying these trees and implementing a

synchronous partial-sums computation on them in order to satisfy the

F&A instructions. The following facts follow readily: the time

required for the simulation is 0(T); and, if processors Pi.P2Â»'**^d

employed local memories of sizes ra, ,m2,...,raj , respectively, for the

simulation of the reading pulse then they employ local memories of size

0(m^+T), 0(m2+T), ..., 0(m^+T) , respectively, for the simulation of the

pulse that involved the F&A instructions.

The refined corespondence between many performance parameters of

an existing procedure and corresponding parameters of a new procedure

seems to be relevant for many computational environments : For instance,

an algorithm for a synchronous parallel shared memory model of

computation may be evaluated by its number of steps, size of each of

its local and common memories, number of accesses to the shared memory

and their frequency and many more parameters. Another example is an

algorithm for synchronous distributed machines. They may be evaluated

-18-

by communication load on each line, sizes of local memories and number

of steps.

The composition of the simulation of the pulse that involves F&A

instructions using execution of the simulation of the reading pulse

gave us a many parameter correspondence between the two simulations.

The notion of lucid-box compositions allows for a closer adherence to

the existing procedure and enables us to use Information which one

cannot expect to find in typical predeclared outputs. Therefore, we

claim that lucid-box compositions are tailored for multi-parameter

optimization of algorithms .

Example II. A Parallel-Design Distributed-Implementation (PDDI)

General-Purpose Computer.

The paper [VI] introduces a scheme of an efficient general-purpose

parallel computer. Its design space (i.e. the model for which

parallel programs are written), is the F&A PRAM. The implementation

space is presented as a scheme of an SDC in which each processor may

communicate with < 4 others. We sketch below how lucid-box

compositions may help in the construction of an efficient translation

of the design space into the implementation space.

The F&A PRAM employs p processors l,2,...,p and m common memory

addresses l,2,...,m.

The SDC consists of a sorting network followed by a merging

network as in Figure 1. Any sorting and merging network can be used .

The SDC employs d (< p) strong processors (called super-processors);

each of them is responsible to simulate the behavior of about p/d F&A

PRAM processors. It also employs 'comparator processors' that behave

similarly to comparator modules of comparison networks and m 'memory

processors', each of them is responsible for simulating the behavior of

one common memory location. We present the solution only for a pulse

of the F&A PRAM of the following form. Assume that processors

i^,i2,. . . ,i(^ want to execute F&A(Aj,ej), F&A(A2,e2), ..., F&A(A,^,e^^)

where A^ is a common memory address and e^ is an address of the local

memory of processor i^ , 1 < j < k, respectively. All other processors

of the F&A PRAM remain idle during the present pulse.

-19-

The translation Into the SDC proceeds in cycles. At the j-th

cycle super-processor i simulates the behavior of processor i+(j-l)d, 1

< 1< d, l

A natural question to be asked is about polynomial time lucid-box

reducibilities; and, in particular, do they affect the theory of

NP-completeness? Unlike the design of efficient algorithms, where new

opportunities are opened, we show in the remainder of this section that

the answer to this question is essentially negative. The argument, as

we shall see, is simple. It is based on the following observation.

Given input variables and a program that operates on them it is an

arbitrary (semantical) decision which program variables are declared to

be outputs.

Extending the notion of lucid-box reducibility to Turing machines

and the related definitions of NP-completeness is straightforward and

therefore omitted. In the introduction we presented the definition of

-13-

[AHU] for NP-completeness. Obviously, every problem which is

NP-coraplete using Cook's reducibility , is NP-complete using [AHU]'s

reducibility. [Ev] implies that it is an open question if the other

direction holds as well. The following sentence relates to the

definition of NP-completeness of [AHU] , given above. Note that both

the program for recognizing Lq and its execution are transparent to us

in a lucid-box reducibility from the problem of recognizing L to the

problem of recognizing Lq . This seems to imply that our definition of

NP-completeness is equivalent to [AHU], since nothing can stop us from

using in any "effective" way the program for recognizing Lq.

We would like to elaborate a little on terminology which is being

used. [GJ] defines a search problem tt to consist of a set D^ of finite

objects called instances and, for each instance, I e D^ , a set S^(I)

of finite objects called solutions for I. An algorithm is said to solve

a search problem it if, given as input any instance I e D^ > it returns

the answer "no" whenever S^(I) is empty and otherwise returns some

solution s belonging to S^(I). Define, alternatively, a proceduFal

solution to ir to consist of an algorithm that solves tt including all

its intermediate computations. Assume that instead of algorithms that

solve search or decision problems we were interested in procedural

solutions of these problems. By definition, any call to a procedural

solution for some problem results in a listing of all its intermediate

computations throughout the T time units during which the procedural

solution ran (for some T). Right after the example in the

introduction, we remarked that a coroutine that runs in time 0(T) can

be simulated by a subroutine that runs in time 0(T ) by restarting it

instead of resuming its operation from the place it was stopped.

Therefore, if in a reduction of the procedural solutions of one

problem to the procedural solutions of another problem we used only

executions of solutions to the second problem we can simulate it by a

polynomial time Turing reduction. However, the (bizzare) possibility

of using the program for the second problem in other ways than

executing it still leaves the question of [Ev] "formally" open.

Remark . It might be interesting to define an "algebra of procedures"

as follows. Its elements (the procedures) will be pairs which consists

-14-

of a program and input and program variables (but must not have output

variables). Lucid-box compositions will serve as the operation of this

algebra since they allow for producing a procedure out of existing

ones. This will be analogous to an algebra of algorithms that can be

defined using black-box compositions. We do not elaborate on this idea

here and leave it for future research.

4 . On_ Structured Programming .

A growing number of programs, documentation of programs and

presentations of algorithms in the literature use principles of

structured programming for modularity and clarity of exposition.

A methodologically useful way to present algorithms in this spirit

is to start with a high-level description of the algorithm which

includes a milestones in their performance. In a hierarchical fashion

this overview is filled with details, until a complete low-level

accurate specification is obtained. For instance, take the

biconnectivity algorithm using depth-first-search (DFS) in [AHU] .

After presenting the DFS method for searching a graph (to which we

refer as a high-level description) this book presents again the DFS now

intermixed with the bookkeeping required for the biconnectivity

algorithm. An alternative approach that we would like to point out

looks at the DFS as a navigator that tells us where to go next, while

there is a bookkeeper that should be told where we are, in order to be

able to do the necessary bookkeeping. We suggest the following

description. There is a 'general manager' that coordinates between the

navigator ('president of the company') and the bookkeeper. The general

manager asks the president where to go, and then transmits the response

to the bookkeeper. The bookkeeper writes down whatever is required and

reports to the general manager when he is done. Then the general

manager consults the president again, and so on.

The modularity of the presentation is increased since now the

high-level description is an independent piece of software. Besides

the important advantage of clarity we would like to point out the

following possible advantage: In many cases a first version of an

algorithm is improved later by polishing its low-level implementation

(bookkeeping) only. It is desirable in such cases to utilize the

-15-

prevlous fault-free high-level specifications. If it is written as an

Independent unit it can be used as is.

[K] mentions the methodological importance of understanding

several coroutines as (symmetric) equal partners, unlike a master-slave

relation between a program and a subroutine it calls. The above

hierarchical description may well fit this understanding by naming a

new general manager who employs all coroutines involved. So no

coroutine dominates another.

Exa mples

Example I. Choice of a model of parallel computation.

The paper [V2] deals with the problem of choosing a theoretical

abstract model of parallel computation to be simulated by synchronous

distributed machines. The principle of choosing the most permissible

model of parallel computation as long as the cost of computational

resources does not increase is applied.

We sketch the main ideas in this paper emphasizing the ones that

relate to lucid-box compositions.

Our machine is assumed to be represented by a model of synchronous

distributed computations (SDC). It employs a sequence of RAM's (see

[AHU]) PpP2,...,Pj that operate synchronously in parallel. Each

processor can communicate directly with no more than c other processors

(where c is some "small" constant) through communication lines.

Communication registers which are associated with the lines are used

for the communication.

The concurrent-read exclusive-write parallel RAM (CREW PRAM) is a

synchronous model of parallel computation in which all p processors

Pi,...,P have access to a shared memory. Simultaneous access to the

same common memory location is allowed for read (but not write)

purposes. The Fetch-and-Add (F&A) PRAM model of synchronous parallel

computation allows every operation which is permitted by the CREW PRAM.

In addition, the following is allowed. Let A be a common memory

address and let ei be some address in the local memory of processor

P, . We define the F&A instruction as follows. If processor P,

performs F&A(A,e^) and no other processor performs at the same time an

instruction that relates to address A then the content of A is

-16-

transraitted Co processor P. and stored In one of its local registers

and address A is assigned with A+e, . Suppose that several processors

perform simultaneously F&A instructions that relate to A. The result is

defined to be as if these instructions were performed serially in some

order.

Generally speaking the main result of the paper is that every

simulation of algorithms given in the CREW PRAM into the SDC has a

counterpart simulation for algorithms given in the F&A PRAM into the

SDC which requires the same SDC-time and SDC-local memories up to a

constant factor. This supports the choice of the F&A PRAM as an

abstract model of parallel computation as was done for the

NYU-Ultracomputer [GGKMRS].

We sketch the main ideas that relate to lucid-box compositions.

Let us take a pulse of the F&A PRAM. Assume, that in this pulse

processors i^ ,i2 , . . . ,ii. want to execute F&A(Ai,ei), F&ACAo .eo) , . . â€¢ ,

F&A(Ai^,e. ), respectively, where A. is a common memory address and e. is

some address in the local memory of processor i. , for I < j < k. All

other processors of the F&A PRAM are assumed to remain idle during the

present pulse. Finally, we arrived at the point where the present

example relates to lucid-box compositions. We 'replace' the given

pulse by the following: every F&A(A-,e-) instruction for processor ij

is replaced by the insruction "read common address A- into local

address e^ of processor i-:", for 1 < j < k. We apply the simulation of

the CREW PRAM into the SDC in order to simulate this 'reading' pulse.

This simulation is done in auxiliary memory locations of the SDC and

all cases where the contents of a (memory or communication) location of

the SDC is copied into another are recorded in the local memory of the

processor that does it. Let A- (resp. e- be a memory location of the

SDC which simulates memory location A. (resp. e-). We assume that

there is exactly one such A. (resp. ^-i)* "^^^ correctness of the

simulation implies that the contents of location A. is propagated from

A. to ez between the time the simulation of the present reading pulse

begins until the time it ends. Say that it takes T cycles of the SDC

machine. We assume that this propagation occurs by repeatedly copying

the contents of A^ from one SDC memory location to another until it is

finally copied into e-. (No splitting of the bits of this contents or

-17-

encodlng of contents of several locations into one are allowed.) This

is called the copying assumption .

Let A be the set of all memory locations (in all local memories)

of the SDC. The following (layered) directed graph G(V,E) is

introduced as a tool for specifying the synchronization of the

remaining steps; edges between two successive layers represent

simultaneous events that take place between two successive ticks of our

'clock'. The set of vertices V of G is V = {(a,t) | a e A, < t < T} .

Layer t, L^, , of G is L^ = {(a,t) | a e A} for t, < t < T. Let

E^ = {[(a,t-l), (b,t)J I A (SDC) processor P^^ , for some 1 < i < s,

copied the contents of address a into

address b at time unit t, 1 < t < t}

E2 = {((a,t-l), (a,t)J I No processor wrote into address a at

time unit t, 1 < t < t}

The set of edges E is the union of E-^ and E2 â€¢ Note, that all edges of

E, were actually recorded at the local memories of corresponding

processors.

It is not difficult to observe that binary trees with A- as roots

and e- as leaves are subgraphs of G. The simulation of the F&A

instructions proceeds by identifying these trees and implementing a

synchronous partial-sums computation on them in order to satisfy the

F&A instructions. The following facts follow readily: the time

required for the simulation is 0(T); and, if processors Pi.P2Â»'**^d

employed local memories of sizes ra, ,m2,...,raj , respectively, for the

simulation of the reading pulse then they employ local memories of size

0(m^+T), 0(m2+T), ..., 0(m^+T) , respectively, for the simulation of the

pulse that involved the F&A instructions.

The refined corespondence between many performance parameters of

an existing procedure and corresponding parameters of a new procedure

seems to be relevant for many computational environments : For instance,

an algorithm for a synchronous parallel shared memory model of

computation may be evaluated by its number of steps, size of each of

its local and common memories, number of accesses to the shared memory

and their frequency and many more parameters. Another example is an

algorithm for synchronous distributed machines. They may be evaluated

-18-

by communication load on each line, sizes of local memories and number

of steps.

The composition of the simulation of the pulse that involves F&A

instructions using execution of the simulation of the reading pulse

gave us a many parameter correspondence between the two simulations.

The notion of lucid-box compositions allows for a closer adherence to

the existing procedure and enables us to use Information which one

cannot expect to find in typical predeclared outputs. Therefore, we

claim that lucid-box compositions are tailored for multi-parameter

optimization of algorithms .

Example II. A Parallel-Design Distributed-Implementation (PDDI)

General-Purpose Computer.

The paper [VI] introduces a scheme of an efficient general-purpose

parallel computer. Its design space (i.e. the model for which

parallel programs are written), is the F&A PRAM. The implementation

space is presented as a scheme of an SDC in which each processor may

communicate with < 4 others. We sketch below how lucid-box

compositions may help in the construction of an efficient translation

of the design space into the implementation space.

The F&A PRAM employs p processors l,2,...,p and m common memory

addresses l,2,...,m.

The SDC consists of a sorting network followed by a merging

network as in Figure 1. Any sorting and merging network can be used .

The SDC employs d (< p) strong processors (called super-processors);

each of them is responsible to simulate the behavior of about p/d F&A

PRAM processors. It also employs 'comparator processors' that behave

similarly to comparator modules of comparison networks and m 'memory

processors', each of them is responsible for simulating the behavior of

one common memory location. We present the solution only for a pulse

of the F&A PRAM of the following form. Assume that processors

i^,i2,. . . ,i(^ want to execute F&A(Aj,ej), F&A(A2,e2), ..., F&A(A,^,e^^)

where A^ is a common memory address and e^ is an address of the local

memory of processor i^ , 1 < j < k, respectively. All other processors

of the F&A PRAM remain idle during the present pulse.

-19-

The translation Into the SDC proceeds in cycles. At the j-th

cycle super-processor i simulates the behavior of processor i+(j-l)d, 1

< 1< d, l