B. D Lubachevsky. # Parallelizing an algorithm of Charles S. Peskin for describing incompressible flow of a fluid coupled to an elastic membrane online

. **(page 2 of 3)**

Online Library → B. D Lubachevsky → Parallelizing an algorithm of Charles S. Peskin for describing incompressible flow of a fluid coupled to an elastic membrane → online text (page 2 of 3)

Font size

minimal value C*(l) and the maximal value C*^^ of C*(P), P = 1,..,32,

nia.x

are also shown. In table 5.1 values of T, K, and C* are given in units

of thousands of simulated CDC-6600 cycles (for example, according to

table 5.1 the maximal execution complexity of a problem with

^'According to C. Peskin, at least two membrane points must be placed in

each mesh Ax^Ax square whose opposite sides are "pierced" by the

membrane. Since N2 and N^ are powers of 2 the coefficient r in the

equality N2 = rxN^ must be a power of 2. Considering a circular

membrane of diameter Ji as a model we can easily see that r = 4 is the

minimal value of r satisfying these limitations. On the other hand for

a given Nj^ an excessively large r (and hence N2) leads to excessive

computations on the membrane without increase of accuracy. Thus r = 4

is optimal for the circle. In the general case, the optimal choice of

r depends on the membrane geometry.

-23-

(N^, N2) = (4,16) was 3109 thousand cycles). The "error %" gives the

maximum over the 32 runs of the relative prediction error

|C(P)-C*(P) |/C*(P), the "error abs" is similarly the maximum of the

absolute prediction errors 1 (C(P)-C*(P) | . (Note that "error %" may be

greater than _ ^ since the maximums need not occur for the

same case)

max

PROBLEM SIZE

Ni,N2(#iter)

Ti 1 T2

1

T3

K

C (1)

error i

% 1 absl

C*(l)

C*

max

4, 16 (6)

7.6149.6

1.6

9.2

1.1%

.18%!

2.61

838.

2176.

8, 32 (5)

17.3(48.0

3.4

8.5

0.5%

.19%!

4.31

1701.

3882.

16, 64 (5)

4, 8 (5)

39.3153.1

7.4138.0

8.3

1.3

9.4

7.5

0.2%

2.2%

.16%!

.12%!

6.9|

.51

4111.

345.

6700.

=========

1734.

8. 16 (5)

17.4142.9

3.5

8.1

1.0%

.23%!

2.11

850.

2296.

16, 32 (5)

39.3|47.9

8.5

8.5

0.4%

.20%!

4.71

2246.

4712.

8, 8 (5)

55.0

3.3

7.5

1.6%

.12%!

.61

464.

2106.

16, 16 (5)

82.0

8.3

8.2

0.6%

.18%!

2.71

1395.

3152.

Table 5.1

The following conclusions were derived by examining the main series

experiments :

a) The model (5.3) allows the prediction of parallel complexity C(P)

very precisely once the serial complexities T's and K are known.

b) Generally all T's and K increase when the problem size increases

-24-

(."^2 fo'^ the (8,32) problem is slightly less than that for

(4,16) problem due to the smaller //iter).

c) Korn's coefficient K may increases with the problem size, however

the ratio K/C^(l) steadily decreases.

5.3. Additional series of runs . As we will see in section 6 the

conclusion c) is especially important for obtaining the efficiency for

large-scale parallel computations. To increase our confidence in

statement c) two additional series of runs with N^ = 2 Ni and No = Ni,

respectively, were performed. By virtue of a) these additional series

were run for fewer number of ultracomputer sizes. Only 11 values of P

were picked, P = 1, 2, 3, 4, 6, 7, 8, 16, 24, 31, 32. Note that in the

series with N^ = Nj^ the linear equations whose coefficients are being

identified by a least square approximation do not allow the

determination of T^^ and T2 separately since the corresponding variables

have the same coefficient ^p(Ni) = Hp(N2) in each of the equations.

For this series the estimations of T, + To were computed.

The results of these additional series are also presented in

table 5.1 (in the second and the third sections which are separated by

double-dashed lines). These series confirm staement c) above.

5.4. Serial complexities . We next analysed how //iter, Korn's

coefficient, and the serial complexities depend on the problem

parameters.

-25-

The value of #iter is a function of all the parameters of the

problem, not only of problem size. Since this dependence is not known,

we would like a high efficiency for any value of //iter. In fact this

is the case; see section 6. The Newton's minimization algorithm was

set for at least two iterations, so #iter may vary from 2 to + o.

Trying to build a complexity model for Korn's coefficient, we

analyse all sources of complexity depending on P, the number of PEs .

The largest portion of this overhead, due to a mismatch between P and

the multiplicity of task spawning, need not be considered here since it

has already been included in the terms Hp(Mj^) x Tj^ of (5.2). Other

sources of the overhead are:

The REQUEST routine in (4.1) must be executed by each PE after all

currently spawned tasks have been assigned. The number of instances of

task spawnings (4.11) is constant for the part of the program dealing

with the mesh. Since task spawnings are embedded into the Newton

minimization loop which is executed #iter times, the number of

instances of task spawnings (4.11) is proportional to

(5.4) A + (B + C X lgN2) X //iter

for the part dealing with membrane. Therefore the overall overhead per

PE due to these factors can be expressed as (5.4).

Line 10 in SYNCH code^ has to be executed by all but one PE

''FORTRAN FTN compiler of the CDC-6600 considers line 10 as a mistake

(which it surely would be if the code was serial). The "parallel"

FORTRAN compiler should allow line 10. In the existing code "line 10"

are in fact two lines, using no-op CONTINUE statement.

-26-

subsequenC to #FREE_PES (INDEX) becoming 0; the overhead per PE can be

expressed in the form (5.4).

To make our prediction of K safer we include a term proportional

to N^ X //iter to the expression of K. (Obtaining a small coefficient

for this term would indicate that we did not raiss components of K

growing much faster than lgN2 x //iter.) Thus we come to the following

prediction formula

(5.5) K = k.Q + (ki + k2 X lgN2 + k3 x N2) x //iter

The four coefficients k^ in (5.5) were identified given 8

experimental values of K from table 5.1 using a least square method as

above. The results are presented in table 5.2, when the error columns

have the same meaning as in table 5.1.

^0 I ^1 I ^2 I k3 I error

I I I I % I abs

2.171.80 1.09 1.002 |2.3%| .19

Table 5.2

The experimental k^ is small enough. (Since one machine cycle

corresponds to the addition of .001 to a coefficient, this k_

corresponds to only two cycles.) This confirms our analysis of

complexity of K.

-27-

The analysis of the dependency of T^^ T2 and T3 on problem

parameters could be done by actually counting instructions in the

compiled code. This method however was rejected because of the large

length of the program (the compiled serial code occupies more than 6000

words, each containing up to 4 instructions). We tried to apply the

linear identification method to express T's as functions of the program

size. By analyzing table 3.1 one sees that the following expressions

might be candidates for identification their coefficients:

(5.6.1) Tj^ = agi + a-ii Nj^ + a2ixN]^lgN j^ -

(5.6.2) T2 = aQ2 + ai2'^*it^^ "*â– a22'*^''iterxlgN2

(5.6.3) T3 = ao3 + aiS^N^ + a23xNixlgNi

Formulas (5.6) with their left-hand sides replaced by experimental

values obtained from table 5.1 allow us to generate a number of linear

equations with unknown a^^^.. (To use experimental values of T^ + T2 in

the series N^ = N2 one has to add (5.6.1) to (5.6.2).) 14 equations

were solved to identify 6 coefficients a. . in (5.6.1) and (5.6.2) and

(separately) 8 equations were solved to identify 3 coefficients a. - in

(5.6.3). The results of these identifications are presented in

table 5.3.

i

1 ^Oi

1 aii

1

1

a2i

1

1

error

% 1 abs

1

1 .34

1 1.29

1

.29

1

â€” 1 \

1

.94% 1 .16

1

2

1 9.30

1 2.65

1

1.02

1 ;

1

3

1 .61

1 -.06(

)l

.14

1

11.% 1 .17

Table 5.3

-28-

This table indicates good accuracy for a. , and a. o and low

accuracy for a.^^^, which means that models (5.6.1) and (5.6.2) are

satisfactory but model (5.6.3) is rather coarse. (Coefficient a,^ even

turns out negative!) This does not contradict table 3.1, which gives

only the order of complexity for each task. Since table 5.1 shows that

T3 is small, in (5.3) the term involving T3 is also small and we have

not refined this model.

-29-

6. The estimation of efficiency.

The aim of this section is to study how the efficiency of the

parallel algorithm depends on the number P of PEs and on the problem

size. The target is to predict efficiency for realistic size two and

three dimensional problems that are too large for us to simulate.

6.1. First let us recall the commonly accepted definitions of

efficiency. The ideal efficiency E^^^^-'- of an algorithm executed by P

PEs is given by

^best

(6.1) Eideal = s

P P T^TpT

where T^Â®^*^ is the time required by the best possible serial algorithm,

and T (p) is the time spent by the parallel algorithm with P PEs.

Since the best serial algorithm is usually not known, TÂ°^st ^^

(6.1) is replaced by Tgoo^, corresponding to the serial algorithm

accepted for the problem:

rpgood

(6.2) Egeal = s

P X Tp(p)

Note that the initial serial algorithm was accepted as "good" in our

problem. After omitting the superscript "real" and "good", (6.2) may

be rewritten as

(6.3) Eâ€ž =

Ts T (1)

X

P Tp(l) Tp(p-)

Df T

The ratio eâ€ž == __!_

-30-

measures the loss due to modifying the

serial code to make it parallel. Ratio ep ==

Df Tâ€ž(l)

P '^ Xp(P)

measures the

loss due to execution of the parallel code by P processor elements.

Our aim now is to evaluate e^ and ep for various values of P.

6.2. Evaluation of e^^. Table 6.1 contains empirically obtained

values of e^ for various problem sizes. The values of Tg used for

calculating e^ were obtained by running the serial code in a special

tracing mode provided by the simulator [UCN12].

N^, N2 (//iter)

4, 16

(6) 1

8, 32

(5)

16, 64

(5)

^0

.925

1

.934

.943

N^, N2 (#iter)

4, 8

(5) 1

8, 16

(5)

16, 32

(5)

^0

.915

I

.933

.948

N^, N2 (#iter)

1

8, 8

(5)

16, 16

(5)

!o

1 .933

.956

Table 6.1

Table 6.1 shows that eg steadily increases toward one when both

parameters Nj, and N2 characterizing the problem size increase, keeping

N2/N]^ fixed. Since the values are not far from one, we conclude that

converting of the serial code into parallel code causes low overhead.

-31-

One can also observe the following irregularity in table 6.1: for

N^ = 4 and 8 the value of eg increases when N2 increases, whereas for

N, = 16 eg decreases when N2 increases.

This effect could be explained by expressing Cq as the weighted

1 2

sura e^ = Aixeg + ^2^^0Â» where Xj and X2 ^^^ the weights of the

execution complexities of the subcodes corresponding to the mesh and

1 2

membrane, respectively, X. + X2 = 1, eg and eg are the efficiencies of

the respective subcodes. These partial 6g's and X's depend on the

problem size. Both partial eg's generally increase with the size of

the problem. However the speed of this increase may be different for

different sizes.

There would be no contradiction with table 6.1 if, for the

2 1

interval of N2 given in that table, the inequality eg > eg held for

smaller N. and the opposite inequality held for larger Nj^. Then an

2

increase in N2 with Nj^ fixed would cause an increase in Xg, and this

would also increase the total e^ for small Nj^ and decrease the total eg

for large N. .

6.3. Efficiencies for small 2-D problems . Table 6.2 contains

empirical and predicted efficiencies ep, P _> 2, for the three regular

(i.e. N2 = 4xNj^) smallest sizes of the 2-D problem. The prediction

was based on the formula

-32-

where C (P) and C (1) are obtained from (5.3) with coefficients from

table 5.1. The predicted values are given in parenthesis. The

agreement is seen to be good.

p

PROBLEM SIZE

Np N2

(//iter)

2

4, ]

.9867

L6 (6)

(.9872)

8, :

= ^^ = = = =

.9927

52 (5)

(.9930)

16,

.9955

64 (5)

(.9957)

3

.8629

(.8632)

.9520

(.9520)

.9527

(.9528)

4

.9644

(.9663)

.9789

(.9793)

.9870

(.9872)

7

.7156

(.7159)

.8475

(.8476)

.8713

(.8702)

8

.8888

(.8906)

.9585

(.9605)

.9707

(.9707)

15

.4756

(.4750)

.6541

(.6542)

.7579

(.7576)

16

.7691

(.7700)

.8466

(.8484)

.9523

(.9538)

31

.3974

(.3974)

.4383

(.4379)

.6135

(.6131)

32

.3850

(.3850)

.6866

(.6878)

.7864

(.7871)

48

(.2566)

(.4584)

(.5248)

64

(.1925)

(.3438)

(.5833)

Table 6.2

6.4. Estimating efficiency for large size problems can not be done

by using formulas (5.3) and (6.4) directly since #iter is not known.

If one substitutes (5.5), (5.6) into (5.3) then efficiency appears as a

rational function E(#iter) of the form

E(#iter) =

A X //iter + B

C X ^;iter + D

where the coefficients A, B, C, D depend on the problem size parameters

-33-

and on P. Since #iCer may vary from 2 to +00 and A, B, C, D > 0, the

(asymptotic) minimum of E(#iter) is E(+oo) = A/C, if AC - BD < 0, and

is E(2) = (2xA+B)/(2xC+D) , otherwise. This minimum is what we will

call Eg, the optimistic estimate of ep.

Since predictions for the serial complexity T^ is rather crude we

p

now develop a more conservative estimate Ep, which we call the

pessimistic estimate. Note that the parallel complexity model (5.3) is

considered accurate in both policies.

Replacing C (P) and C (1) in (6.4) by their expressions from (5.2)

and carrying out the obvious algebraic transformation one can obtain

the following expression for e^:

â– L ,â€¢ 1

(6.5) e = Z X X e^ + Xq X ,

where the weights X^, i = 1,,..,P, are given by

Hp(Mi) X Ti

(6.6) X.

P =p(Mi) X Ti +. ..+ Hp(Mp) X Tl + P X K

and the partial efficiencies ei are given by

(6.7) e^ = J_J_.

Formula (6.6) is valid for i = 0, if one takes K as Tq and P as Xp(MQ),

After discarding the term Xâ€ž x _ in (6.5) one has:

-3A-

L ,. L L

(6.8) e > z X. X ep = (2 >^0 (^ ^^-i >" ep)

^ i=l ^ i=l ^ 1=1 ^ ^

where new weights M^'s are given by \iÂ± = o^ else

E X.

Hp(Mi) X Ti

(6.9) M,

1 -p(Mi) X Ti +...+ =p(Mp) X Tl

Note that

(6.10) ix. = ' ^"\- "^^ > ^ ^^^:/^^ = i-p^^.

i=i 1 TTF) - TTTl cTTT

L

Substituting the expression for E X from (6.10) into (6.8), one gets

i=l -^

the following lower bound on ep for large size problems:

K

L

(6.11) ep > (1 - P X _Â±_) X (Z M. X e^)

where U^'s are given by (6.8) and ep's are given by (6.7).

In the pessimistic policy we estimate the ratio _^___ using the

C (1) ^

minimal of its observed values in the small size experiments (see

section 5), i.e. by 0.2% and in (6.11) the "vi .-weighted" sum is

replaced by the minimum of e^ over all i = 1,...,L. Hence

(6.12) Ep = (1 - P X 0.002) X (min (eâ„¢, e^, e^^ ) )

where e^, ep, ep are partial efficiencies corresponding T^^, T2 and T3,

respectively.

-35-

Entries in table 6.3 above are E(5)/EÂ°/Ep. Symbol " â€” " replaces

negative E^.

p

PROBLEM SIZE N^, N2

2

32, 128

.997/. 996/. 941

64, 256

.998/. 997/. 967

128, 512

.999/. 998/. 981

256, 1024

===============

.999/. 999/. 988

4

.991/. 989/. 843

.994/. 993/. 909

.996/. 995/. 948

.998/. 997/. 969

8

.98/. 97/. 70

.99/. 98/. 81

.99/. 99/. 89

.99/. 99/. 93

16

.96/. 95/. 51

.97/. 96/. 67

.98/. 98/. 77

.99/. 99/. 87

32

.94/. 94/. 50

.94/. 93/. 48

.96/. 95/. 63

.98/. 97/. 75

64

.71/. 62/. 23

.93/. 92/. 45

.92/. 91/. 44

.95/. 95/. 59

128

.47/. 38/. 10

.62/. 55/. 19

.91/. 91/. 38

.91/. 90/. 37

256

.23/. 19/. 03

.37/. 31/. 06

.55/. 50/. 12

.90/. 90/. 25

512

.12/.09/~

.19/.15/~

|.30/.27/~

.50/. 47/ â€”

1024

.06/.05/~

.09/.08/~

|.15/.13/~

I. 26/. 24/ -

Table 6.3

Data in table 6.3 show that parallelization is efficient in the

2-dimensional case for up to a few hundred PEs .

6.5. Efficiency prediction for 3-D problems . In the 3-D case, the

membrane is 2-dimensional and many different structures are possible.

We suppose, however, that the membrane is reinforced by a collection of

one-dimensional fibers and that the minimization problem (2.3)-(2.5) is

to be solved independently on each fiber in this collection. Such a

collection can be treated as a single large fiber. (This may not be

-36-

the most efficient method, but it is the one we can analyse without

further programming.)

Let the total number of membrane points be N2 = 8 X N]^~. This

relation may be considered as optimal in the considered case (cf.

footnote on p. 22). We will assign the same amount of work for each

task, but replace multiplicities N. by Ni for tasks dealing with mesh.

(The multiplicity ___ + 1 of a non-standard FFT task must be replaced by

N, X ( - - + 1).) Orders of complexities of tasks remain the same as in

table 3.1. Whereas the number of invocations required and thus the

values of coefficients a^. in (5.7) may change, we will not try to

predict a^. and k^^^ in the 3-D case. Instead we use the 2-D values.

Table 6.4 presents the EÂ° predictions obtained.

Note that Ep increases when a^. increase. (This may be proven

analytically. Since the formula for computing EÂ° was implemented as a

computer program, we prefered to check it experimentally. For the

problem size N^ = 256, N2 = 524288 the values of the a^js were

independently and uniformly varied in the range from the nominal value

to the 150% of it. All 1000 computed values of trial EÂ°â€žâ€ž, appears to

be greater than the nominal value of this quantity.)

-37-

p

PROBLEM

SIZE N^, N2

2

32, 8192

.9999

1 64

32768

.9999

128, 131072

.9999

256, 524288

.9999

4

.9996

.9997

.9997

.9997

8

.9991

.9993

.9993

.9994

16

.998

.998

.999

.999

32

.996

.997

.997

.997

64

.992

.993

.994

.995

128

.984

.987

.988

.989

256

.968

.974

.977

.979

512

.937

.949

.955

.959

1024

.882

.902

.913

.921

2048

.692

.821

.840

.854

4096

.471

.697

.724

.744

8192

.288

.527

.568

.593

Table 6.4

Data in table 6.3 shows that parallelization in the 3-dimensional

case is efficient for up to a few thousand of PEs . In particular, a

problem of size (64,32768) (a size C. Peskin believes to be of

interest) can efficiently utilize an entire Ultracomputer consisting of

4096 PEs.

-38-

7, The pattern of the Ultracomputer activity

with network delay simulation.

Several additional runs of the program were simulated more

accurately by including delays in the network connecting PEs with

shared memory (SM). This was done using the NETSIM simulator [UCN28]

for a 6-stage network of 4 4 switches (see [UCN40]). This corresponds

to the Ultracomputer with 4096 PEs. In our simulations a given

fraction of the Ultracomputer (4, 8 and 16 PEs, respectively) was

executing our program while the rest of the Ultracomuter was idle, i.e.

did not produce additional network traffic.

Note that a request to the shared memory (SM) on an unloaded

nerwork proceeds for 8 cycles, so that the delay is at least 7 more

cycles than in the simulations described above.

The overall additional delay of the simulation and various memory

referencing statistics for runs on a problem of sizes N, =8, No = 32

(//iter = 5) are represented in following table 7.1.

The effective average delay per SM load (row 7 in table 7.1) was

computed as

#PEs X (tj - tâ€ž)

//SM_loads

where t^, t^j are times of execution without (row 1) and with (row 2)

network delay, respectively; //SM loads is computed as #SM references

-39-

(row 4) multiplied by %SM_loads (row 5). This delay is less than the

network latency due to prefetching.

//ROW

STATISTICS

time of execution

without delay

NUMBER OF PES EXECUTING

THE PROGRAM

1

4

434526

8

221885

16

125603

2

time of execution

with delay

581570

(+34%)

296630

(+34%)

172797

(+38%)

3

# of data memory

references

596480

589448

671314

4

^^ of SM references

134388

(23% of all)

142639

(24% of all)

180101

(27% of all)

5

% of all of SM

loads /s tores /FAAs

82%/15%/3%

82%/15%/3%

84%/13%/3%

6

average delay per

loads/stores/FAAs

9.54/9.55/10.4

8.97/8.96/9.41

8.76/8.77/9.03

7

effective average

6.36

6.14

6.17

8

% of all of local

loads/stores

70%/30%

70%/30%

7 2%/ 2 8%

Table 7.1

The delay through the network is another factor decreasing

performance of the parallel algorithm. According to table 7.1 this

delay is not excessive. It seemingly does not increase when degree of

parallelization P increases if P remains less than N. .

-40-

It is likely that idle waiting creates more network traffic than

normal processing. This is confirmed by:

1) For P > N^ idle waiting intensifies the stream of idle SM

references inside SYNCH loop 10 (see fig. 3.1) whose partial delay must

be greater than the average;

2) The number of SM references noticably increases for case of 16

PEs; while the fraction of SM loads also increses. Note that one

execution of loop 10 at SYNCH routine corresponds to one SM load;

3) The delay increses when number of PEs increases from 8 (delay

34%) to 16 (delay 38%).

Note that if the operating system uses idle PEs for other jobs the

overall delay might decrease.

8. Conclusion.

The algorithms of the considered class can be easily adapted for

parallel processing on the Ultracomputer , using task spawning mechanism

for loop parallelization. Various simulation results presented show

high efficiency of the execution with hundreds of PEs in the 2-D case

and with thousands of PEs in the 3-D case. The degree of

parallelization of successive task spawnings varies and the problem

execution generates numerous busy waits. Therefore the throughput of

the machine will be improved without lowering the speed-up for the main

job, if several small low priority jobs are running concurrently with

it.

-41-

Acknowledgement .

The author thanks Gabi Leshera for an essential programming help in

writing and debugging the subroutines for minimizing the tension

function. Gabi also noted that subroutine SYNCH (originally suggested

by Malvin Kalos and the author and analyzed in [UCN33]) can be adapted

for implementing SPAWN-WAIT mechanism to have initialization of spawn

parameters.

The proof reading and many technical comments by Allan Gottlieb,

Malvin H. Kalos and Charles S. Peskin greatly improved the quality of

the exposition.

-42-

REFERRENCES

Pes. C. S. Peskin. "Numerical Analysis of Blood Flow in the Heart",

Journ. Coraput. Physics, Vol.25, pp. 220-252, 1977.

UCN12. Allan Gottlieb. "Washcloth - The Logical Successor to Soapsuds",

Ultracomputer Note No. 12, Courant Institute, NYU, October 1980.

UCN13. Allan Gottlieb. "MOP - A (Minimal) Multiprocessor Operating System

Extending Washcloth", Ultracomputer Note No. 13, Courant Institute, NYU,

November 1980.

UCN16. Allan Gottlieb, B. D. Lubachevsky, and Larry Rudolph. "Basic

Techniques for the Efficient Coordination of Very Large Numbers of

Cooperating Sequential Processors", Ultracomputer Note No. 16, Courant

Institute, NYU, December 1980; to appear in ACM TOPLAS.

UCN18. Charles S. Peskin and Olof B. Widlund. "Remarks on Efficient

Numerical Algorithms for Ultracomputers" , Ultracomputer Note No. 18,

Courant Institute, NYU, January 1981.

UCN19. Charles S. Peskin. "Ultracomputer Implementation of Odd-Even

Reduction", Ultracomputer Note No. 19, Courant Institute, NYU, January

1981.

UCN20. Charles S. Peskin. "A Comparison of Ultracomputer Architecture and

Lattice Architecure for Problems on Lattices", Ultracomputer Note

No. 20, Courant Institute, NYU, January 1981.

UCN23. David Korn. "Scientific Code Conversion", Ultracomputer Note No. 23,

Courant Institute, NYU, January 1981.

UCN24. David Korn. "Timing Analysis for Codes Run Under the Washcloth

Simulator", Ultracomputer Note No. 24, Courant Institute, NYU, March

1981.

UCN28. Marc Snir. "Netsim - Network Simulator for the Ultracomputer",

Ultracomputer Note No. 28, Courant Institute, NYU, May 1981.

UCN33. B. D. Lubachevsky. "Verification of Several Parallel Coordination

Programs Based on Description of their Reachability Sets",

Ultracomputer Note No. 33, Courant Institute, NYU, July 1981.

UCN40. Allan Gottlieb, Ralph Grishman, Clyde P. Kruskal,

Kevin P. McAuliffe, Larry Rudolph, and Marc Snir, "The NYU

nia.x

are also shown. In table 5.1 values of T, K, and C* are given in units

of thousands of simulated CDC-6600 cycles (for example, according to

table 5.1 the maximal execution complexity of a problem with

^'According to C. Peskin, at least two membrane points must be placed in

each mesh Ax^Ax square whose opposite sides are "pierced" by the

membrane. Since N2 and N^ are powers of 2 the coefficient r in the

equality N2 = rxN^ must be a power of 2. Considering a circular

membrane of diameter Ji as a model we can easily see that r = 4 is the

minimal value of r satisfying these limitations. On the other hand for

a given Nj^ an excessively large r (and hence N2) leads to excessive

computations on the membrane without increase of accuracy. Thus r = 4

is optimal for the circle. In the general case, the optimal choice of

r depends on the membrane geometry.

-23-

(N^, N2) = (4,16) was 3109 thousand cycles). The "error %" gives the

maximum over the 32 runs of the relative prediction error

|C(P)-C*(P) |/C*(P), the "error abs" is similarly the maximum of the

absolute prediction errors 1 (C(P)-C*(P) | . (Note that "error %" may be

greater than _ ^ since the maximums need not occur for the

same case)

max

PROBLEM SIZE

Ni,N2(#iter)

Ti 1 T2

1

T3

K

C (1)

error i

% 1 absl

C*(l)

C*

max

4, 16 (6)

7.6149.6

1.6

9.2

1.1%

.18%!

2.61

838.

2176.

8, 32 (5)

17.3(48.0

3.4

8.5

0.5%

.19%!

4.31

1701.

3882.

16, 64 (5)

4, 8 (5)

39.3153.1

7.4138.0

8.3

1.3

9.4

7.5

0.2%

2.2%

.16%!

.12%!

6.9|

.51

4111.

345.

6700.

=========

1734.

8. 16 (5)

17.4142.9

3.5

8.1

1.0%

.23%!

2.11

850.

2296.

16, 32 (5)

39.3|47.9

8.5

8.5

0.4%

.20%!

4.71

2246.

4712.

8, 8 (5)

55.0

3.3

7.5

1.6%

.12%!

.61

464.

2106.

16, 16 (5)

82.0

8.3

8.2

0.6%

.18%!

2.71

1395.

3152.

Table 5.1

The following conclusions were derived by examining the main series

experiments :

a) The model (5.3) allows the prediction of parallel complexity C(P)

very precisely once the serial complexities T's and K are known.

b) Generally all T's and K increase when the problem size increases

-24-

(."^2 fo'^ the (8,32) problem is slightly less than that for

(4,16) problem due to the smaller //iter).

c) Korn's coefficient K may increases with the problem size, however

the ratio K/C^(l) steadily decreases.

5.3. Additional series of runs . As we will see in section 6 the

conclusion c) is especially important for obtaining the efficiency for

large-scale parallel computations. To increase our confidence in

statement c) two additional series of runs with N^ = 2 Ni and No = Ni,

respectively, were performed. By virtue of a) these additional series

were run for fewer number of ultracomputer sizes. Only 11 values of P

were picked, P = 1, 2, 3, 4, 6, 7, 8, 16, 24, 31, 32. Note that in the

series with N^ = Nj^ the linear equations whose coefficients are being

identified by a least square approximation do not allow the

determination of T^^ and T2 separately since the corresponding variables

have the same coefficient ^p(Ni) = Hp(N2) in each of the equations.

For this series the estimations of T, + To were computed.

The results of these additional series are also presented in

table 5.1 (in the second and the third sections which are separated by

double-dashed lines). These series confirm staement c) above.

5.4. Serial complexities . We next analysed how //iter, Korn's

coefficient, and the serial complexities depend on the problem

parameters.

-25-

The value of #iter is a function of all the parameters of the

problem, not only of problem size. Since this dependence is not known,

we would like a high efficiency for any value of //iter. In fact this

is the case; see section 6. The Newton's minimization algorithm was

set for at least two iterations, so #iter may vary from 2 to + o.

Trying to build a complexity model for Korn's coefficient, we

analyse all sources of complexity depending on P, the number of PEs .

The largest portion of this overhead, due to a mismatch between P and

the multiplicity of task spawning, need not be considered here since it

has already been included in the terms Hp(Mj^) x Tj^ of (5.2). Other

sources of the overhead are:

The REQUEST routine in (4.1) must be executed by each PE after all

currently spawned tasks have been assigned. The number of instances of

task spawnings (4.11) is constant for the part of the program dealing

with the mesh. Since task spawnings are embedded into the Newton

minimization loop which is executed #iter times, the number of

instances of task spawnings (4.11) is proportional to

(5.4) A + (B + C X lgN2) X //iter

for the part dealing with membrane. Therefore the overall overhead per

PE due to these factors can be expressed as (5.4).

Line 10 in SYNCH code^ has to be executed by all but one PE

''FORTRAN FTN compiler of the CDC-6600 considers line 10 as a mistake

(which it surely would be if the code was serial). The "parallel"

FORTRAN compiler should allow line 10. In the existing code "line 10"

are in fact two lines, using no-op CONTINUE statement.

-26-

subsequenC to #FREE_PES (INDEX) becoming 0; the overhead per PE can be

expressed in the form (5.4).

To make our prediction of K safer we include a term proportional

to N^ X //iter to the expression of K. (Obtaining a small coefficient

for this term would indicate that we did not raiss components of K

growing much faster than lgN2 x //iter.) Thus we come to the following

prediction formula

(5.5) K = k.Q + (ki + k2 X lgN2 + k3 x N2) x //iter

The four coefficients k^ in (5.5) were identified given 8

experimental values of K from table 5.1 using a least square method as

above. The results are presented in table 5.2, when the error columns

have the same meaning as in table 5.1.

^0 I ^1 I ^2 I k3 I error

I I I I % I abs

2.171.80 1.09 1.002 |2.3%| .19

Table 5.2

The experimental k^ is small enough. (Since one machine cycle

corresponds to the addition of .001 to a coefficient, this k_

corresponds to only two cycles.) This confirms our analysis of

complexity of K.

-27-

The analysis of the dependency of T^^ T2 and T3 on problem

parameters could be done by actually counting instructions in the

compiled code. This method however was rejected because of the large

length of the program (the compiled serial code occupies more than 6000

words, each containing up to 4 instructions). We tried to apply the

linear identification method to express T's as functions of the program

size. By analyzing table 3.1 one sees that the following expressions

might be candidates for identification their coefficients:

(5.6.1) Tj^ = agi + a-ii Nj^ + a2ixN]^lgN j^ -

(5.6.2) T2 = aQ2 + ai2'^*it^^ "*â– a22'*^''iterxlgN2

(5.6.3) T3 = ao3 + aiS^N^ + a23xNixlgNi

Formulas (5.6) with their left-hand sides replaced by experimental

values obtained from table 5.1 allow us to generate a number of linear

equations with unknown a^^^.. (To use experimental values of T^ + T2 in

the series N^ = N2 one has to add (5.6.1) to (5.6.2).) 14 equations

were solved to identify 6 coefficients a. . in (5.6.1) and (5.6.2) and

(separately) 8 equations were solved to identify 3 coefficients a. - in

(5.6.3). The results of these identifications are presented in

table 5.3.

i

1 ^Oi

1 aii

1

1

a2i

1

1

error

% 1 abs

1

1 .34

1 1.29

1

.29

1

â€” 1 \

1

.94% 1 .16

1

2

1 9.30

1 2.65

1

1.02

1 ;

1

3

1 .61

1 -.06(

)l

.14

1

11.% 1 .17

Table 5.3

-28-

This table indicates good accuracy for a. , and a. o and low

accuracy for a.^^^, which means that models (5.6.1) and (5.6.2) are

satisfactory but model (5.6.3) is rather coarse. (Coefficient a,^ even

turns out negative!) This does not contradict table 3.1, which gives

only the order of complexity for each task. Since table 5.1 shows that

T3 is small, in (5.3) the term involving T3 is also small and we have

not refined this model.

-29-

6. The estimation of efficiency.

The aim of this section is to study how the efficiency of the

parallel algorithm depends on the number P of PEs and on the problem

size. The target is to predict efficiency for realistic size two and

three dimensional problems that are too large for us to simulate.

6.1. First let us recall the commonly accepted definitions of

efficiency. The ideal efficiency E^^^^-'- of an algorithm executed by P

PEs is given by

^best

(6.1) Eideal = s

P P T^TpT

where T^Â®^*^ is the time required by the best possible serial algorithm,

and T (p) is the time spent by the parallel algorithm with P PEs.

Since the best serial algorithm is usually not known, TÂ°^st ^^

(6.1) is replaced by Tgoo^, corresponding to the serial algorithm

accepted for the problem:

rpgood

(6.2) Egeal = s

P X Tp(p)

Note that the initial serial algorithm was accepted as "good" in our

problem. After omitting the superscript "real" and "good", (6.2) may

be rewritten as

(6.3) Eâ€ž =

Ts T (1)

X

P Tp(l) Tp(p-)

Df T

The ratio eâ€ž == __!_

-30-

measures the loss due to modifying the

serial code to make it parallel. Ratio ep ==

Df Tâ€ž(l)

P '^ Xp(P)

measures the

loss due to execution of the parallel code by P processor elements.

Our aim now is to evaluate e^ and ep for various values of P.

6.2. Evaluation of e^^. Table 6.1 contains empirically obtained

values of e^ for various problem sizes. The values of Tg used for

calculating e^ were obtained by running the serial code in a special

tracing mode provided by the simulator [UCN12].

N^, N2 (//iter)

4, 16

(6) 1

8, 32

(5)

16, 64

(5)

^0

.925

1

.934

.943

N^, N2 (#iter)

4, 8

(5) 1

8, 16

(5)

16, 32

(5)

^0

.915

I

.933

.948

N^, N2 (#iter)

1

8, 8

(5)

16, 16

(5)

!o

1 .933

.956

Table 6.1

Table 6.1 shows that eg steadily increases toward one when both

parameters Nj, and N2 characterizing the problem size increase, keeping

N2/N]^ fixed. Since the values are not far from one, we conclude that

converting of the serial code into parallel code causes low overhead.

-31-

One can also observe the following irregularity in table 6.1: for

N^ = 4 and 8 the value of eg increases when N2 increases, whereas for

N, = 16 eg decreases when N2 increases.

This effect could be explained by expressing Cq as the weighted

1 2

sura e^ = Aixeg + ^2^^0Â» where Xj and X2 ^^^ the weights of the

execution complexities of the subcodes corresponding to the mesh and

1 2

membrane, respectively, X. + X2 = 1, eg and eg are the efficiencies of

the respective subcodes. These partial 6g's and X's depend on the

problem size. Both partial eg's generally increase with the size of

the problem. However the speed of this increase may be different for

different sizes.

There would be no contradiction with table 6.1 if, for the

2 1

interval of N2 given in that table, the inequality eg > eg held for

smaller N. and the opposite inequality held for larger Nj^. Then an

2

increase in N2 with Nj^ fixed would cause an increase in Xg, and this

would also increase the total e^ for small Nj^ and decrease the total eg

for large N. .

6.3. Efficiencies for small 2-D problems . Table 6.2 contains

empirical and predicted efficiencies ep, P _> 2, for the three regular

(i.e. N2 = 4xNj^) smallest sizes of the 2-D problem. The prediction

was based on the formula

-32-

where C (P) and C (1) are obtained from (5.3) with coefficients from

table 5.1. The predicted values are given in parenthesis. The

agreement is seen to be good.

p

PROBLEM SIZE

Np N2

(//iter)

2

4, ]

.9867

L6 (6)

(.9872)

8, :

= ^^ = = = =

.9927

52 (5)

(.9930)

16,

.9955

64 (5)

(.9957)

3

.8629

(.8632)

.9520

(.9520)

.9527

(.9528)

4

.9644

(.9663)

.9789

(.9793)

.9870

(.9872)

7

.7156

(.7159)

.8475

(.8476)

.8713

(.8702)

8

.8888

(.8906)

.9585

(.9605)

.9707

(.9707)

15

.4756

(.4750)

.6541

(.6542)

.7579

(.7576)

16

.7691

(.7700)

.8466

(.8484)

.9523

(.9538)

31

.3974

(.3974)

.4383

(.4379)

.6135

(.6131)

32

.3850

(.3850)

.6866

(.6878)

.7864

(.7871)

48

(.2566)

(.4584)

(.5248)

64

(.1925)

(.3438)

(.5833)

Table 6.2

6.4. Estimating efficiency for large size problems can not be done

by using formulas (5.3) and (6.4) directly since #iter is not known.

If one substitutes (5.5), (5.6) into (5.3) then efficiency appears as a

rational function E(#iter) of the form

E(#iter) =

A X //iter + B

C X ^;iter + D

where the coefficients A, B, C, D depend on the problem size parameters

-33-

and on P. Since #iCer may vary from 2 to +00 and A, B, C, D > 0, the

(asymptotic) minimum of E(#iter) is E(+oo) = A/C, if AC - BD < 0, and

is E(2) = (2xA+B)/(2xC+D) , otherwise. This minimum is what we will

call Eg, the optimistic estimate of ep.

Since predictions for the serial complexity T^ is rather crude we

p

now develop a more conservative estimate Ep, which we call the

pessimistic estimate. Note that the parallel complexity model (5.3) is

considered accurate in both policies.

Replacing C (P) and C (1) in (6.4) by their expressions from (5.2)

and carrying out the obvious algebraic transformation one can obtain

the following expression for e^:

â– L ,â€¢ 1

(6.5) e = Z X X e^ + Xq X ,

where the weights X^, i = 1,,..,P, are given by

Hp(Mi) X Ti

(6.6) X.

P =p(Mi) X Ti +. ..+ Hp(Mp) X Tl + P X K

and the partial efficiencies ei are given by

(6.7) e^ = J_J_.

Formula (6.6) is valid for i = 0, if one takes K as Tq and P as Xp(MQ),

After discarding the term Xâ€ž x _ in (6.5) one has:

-3A-

L ,. L L

(6.8) e > z X. X ep = (2 >^0 (^ ^^-i >" ep)

^ i=l ^ i=l ^ 1=1 ^ ^

where new weights M^'s are given by \iÂ± = o^ else

E X.

Hp(Mi) X Ti

(6.9) M,

1 -p(Mi) X Ti +...+ =p(Mp) X Tl

Note that

(6.10) ix. = ' ^"\- "^^ > ^ ^^^:/^^ = i-p^^.

i=i 1 TTF) - TTTl cTTT

L

Substituting the expression for E X from (6.10) into (6.8), one gets

i=l -^

the following lower bound on ep for large size problems:

K

L

(6.11) ep > (1 - P X _Â±_) X (Z M. X e^)

where U^'s are given by (6.8) and ep's are given by (6.7).

In the pessimistic policy we estimate the ratio _^___ using the

C (1) ^

minimal of its observed values in the small size experiments (see

section 5), i.e. by 0.2% and in (6.11) the "vi .-weighted" sum is

replaced by the minimum of e^ over all i = 1,...,L. Hence

(6.12) Ep = (1 - P X 0.002) X (min (eâ„¢, e^, e^^ ) )

where e^, ep, ep are partial efficiencies corresponding T^^, T2 and T3,

respectively.

-35-

Entries in table 6.3 above are E(5)/EÂ°/Ep. Symbol " â€” " replaces

negative E^.

p

PROBLEM SIZE N^, N2

2

32, 128

.997/. 996/. 941

64, 256

.998/. 997/. 967

128, 512

.999/. 998/. 981

256, 1024

===============

.999/. 999/. 988

4

.991/. 989/. 843

.994/. 993/. 909

.996/. 995/. 948

.998/. 997/. 969

8

.98/. 97/. 70

.99/. 98/. 81

.99/. 99/. 89

.99/. 99/. 93

16

.96/. 95/. 51

.97/. 96/. 67

.98/. 98/. 77

.99/. 99/. 87

32

.94/. 94/. 50

.94/. 93/. 48

.96/. 95/. 63

.98/. 97/. 75

64

.71/. 62/. 23

.93/. 92/. 45

.92/. 91/. 44

.95/. 95/. 59

128

.47/. 38/. 10

.62/. 55/. 19

.91/. 91/. 38

.91/. 90/. 37

256

.23/. 19/. 03

.37/. 31/. 06

.55/. 50/. 12

.90/. 90/. 25

512

.12/.09/~

.19/.15/~

|.30/.27/~

.50/. 47/ â€”

1024

.06/.05/~

.09/.08/~

|.15/.13/~

I. 26/. 24/ -

Table 6.3

Data in table 6.3 show that parallelization is efficient in the

2-dimensional case for up to a few hundred PEs .

6.5. Efficiency prediction for 3-D problems . In the 3-D case, the

membrane is 2-dimensional and many different structures are possible.

We suppose, however, that the membrane is reinforced by a collection of

one-dimensional fibers and that the minimization problem (2.3)-(2.5) is

to be solved independently on each fiber in this collection. Such a

collection can be treated as a single large fiber. (This may not be

-36-

the most efficient method, but it is the one we can analyse without

further programming.)

Let the total number of membrane points be N2 = 8 X N]^~. This

relation may be considered as optimal in the considered case (cf.

footnote on p. 22). We will assign the same amount of work for each

task, but replace multiplicities N. by Ni for tasks dealing with mesh.

(The multiplicity ___ + 1 of a non-standard FFT task must be replaced by

N, X ( - - + 1).) Orders of complexities of tasks remain the same as in

table 3.1. Whereas the number of invocations required and thus the

values of coefficients a^. in (5.7) may change, we will not try to

predict a^. and k^^^ in the 3-D case. Instead we use the 2-D values.

Table 6.4 presents the EÂ° predictions obtained.

Note that Ep increases when a^. increase. (This may be proven

analytically. Since the formula for computing EÂ° was implemented as a

computer program, we prefered to check it experimentally. For the

problem size N^ = 256, N2 = 524288 the values of the a^js were

independently and uniformly varied in the range from the nominal value

to the 150% of it. All 1000 computed values of trial EÂ°â€žâ€ž, appears to

be greater than the nominal value of this quantity.)

-37-

p

PROBLEM

SIZE N^, N2

2

32, 8192

.9999

1 64

32768

.9999

128, 131072

.9999

256, 524288

.9999

4

.9996

.9997

.9997

.9997

8

.9991

.9993

.9993

.9994

16

.998

.998

.999

.999

32

.996

.997

.997

.997

64

.992

.993

.994

.995

128

.984

.987

.988

.989

256

.968

.974

.977

.979

512

.937

.949

.955

.959

1024

.882

.902

.913

.921

2048

.692

.821

.840

.854

4096

.471

.697

.724

.744

8192

.288

.527

.568

.593

Table 6.4

Data in table 6.3 shows that parallelization in the 3-dimensional

case is efficient for up to a few thousand of PEs . In particular, a

problem of size (64,32768) (a size C. Peskin believes to be of

interest) can efficiently utilize an entire Ultracomputer consisting of

4096 PEs.

-38-

7, The pattern of the Ultracomputer activity

with network delay simulation.

Several additional runs of the program were simulated more

accurately by including delays in the network connecting PEs with

shared memory (SM). This was done using the NETSIM simulator [UCN28]

for a 6-stage network of 4 4 switches (see [UCN40]). This corresponds

to the Ultracomputer with 4096 PEs. In our simulations a given

fraction of the Ultracomputer (4, 8 and 16 PEs, respectively) was

executing our program while the rest of the Ultracomuter was idle, i.e.

did not produce additional network traffic.

Note that a request to the shared memory (SM) on an unloaded

nerwork proceeds for 8 cycles, so that the delay is at least 7 more

cycles than in the simulations described above.

The overall additional delay of the simulation and various memory

referencing statistics for runs on a problem of sizes N, =8, No = 32

(//iter = 5) are represented in following table 7.1.

The effective average delay per SM load (row 7 in table 7.1) was

computed as

#PEs X (tj - tâ€ž)

//SM_loads

where t^, t^j are times of execution without (row 1) and with (row 2)

network delay, respectively; //SM loads is computed as #SM references

-39-

(row 4) multiplied by %SM_loads (row 5). This delay is less than the

network latency due to prefetching.

//ROW

STATISTICS

time of execution

without delay

NUMBER OF PES EXECUTING

THE PROGRAM

1

4

434526

8

221885

16

125603

2

time of execution

with delay

581570

(+34%)

296630

(+34%)

172797

(+38%)

3

# of data memory

references

596480

589448

671314

4

^^ of SM references

134388

(23% of all)

142639

(24% of all)

180101

(27% of all)

5

% of all of SM

loads /s tores /FAAs

82%/15%/3%

82%/15%/3%

84%/13%/3%

6

average delay per

loads/stores/FAAs

9.54/9.55/10.4

8.97/8.96/9.41

8.76/8.77/9.03

7

effective average

6.36

6.14

6.17

8

% of all of local

loads/stores

70%/30%

70%/30%

7 2%/ 2 8%

Table 7.1

The delay through the network is another factor decreasing

performance of the parallel algorithm. According to table 7.1 this

delay is not excessive. It seemingly does not increase when degree of

parallelization P increases if P remains less than N. .

-40-

It is likely that idle waiting creates more network traffic than

normal processing. This is confirmed by:

1) For P > N^ idle waiting intensifies the stream of idle SM

references inside SYNCH loop 10 (see fig. 3.1) whose partial delay must

be greater than the average;

2) The number of SM references noticably increases for case of 16

PEs; while the fraction of SM loads also increses. Note that one

execution of loop 10 at SYNCH routine corresponds to one SM load;

3) The delay increses when number of PEs increases from 8 (delay

34%) to 16 (delay 38%).

Note that if the operating system uses idle PEs for other jobs the

overall delay might decrease.

8. Conclusion.

The algorithms of the considered class can be easily adapted for

parallel processing on the Ultracomputer , using task spawning mechanism

for loop parallelization. Various simulation results presented show

high efficiency of the execution with hundreds of PEs in the 2-D case

and with thousands of PEs in the 3-D case. The degree of

parallelization of successive task spawnings varies and the problem

execution generates numerous busy waits. Therefore the throughput of

the machine will be improved without lowering the speed-up for the main

job, if several small low priority jobs are running concurrently with

it.

-41-

Acknowledgement .

The author thanks Gabi Leshera for an essential programming help in

writing and debugging the subroutines for minimizing the tension

function. Gabi also noted that subroutine SYNCH (originally suggested

by Malvin Kalos and the author and analyzed in [UCN33]) can be adapted

for implementing SPAWN-WAIT mechanism to have initialization of spawn

parameters.

The proof reading and many technical comments by Allan Gottlieb,

Malvin H. Kalos and Charles S. Peskin greatly improved the quality of

the exposition.

-42-

REFERRENCES

Pes. C. S. Peskin. "Numerical Analysis of Blood Flow in the Heart",

Journ. Coraput. Physics, Vol.25, pp. 220-252, 1977.

UCN12. Allan Gottlieb. "Washcloth - The Logical Successor to Soapsuds",

Ultracomputer Note No. 12, Courant Institute, NYU, October 1980.

UCN13. Allan Gottlieb. "MOP - A (Minimal) Multiprocessor Operating System

Extending Washcloth", Ultracomputer Note No. 13, Courant Institute, NYU,

November 1980.

UCN16. Allan Gottlieb, B. D. Lubachevsky, and Larry Rudolph. "Basic

Techniques for the Efficient Coordination of Very Large Numbers of

Cooperating Sequential Processors", Ultracomputer Note No. 16, Courant

Institute, NYU, December 1980; to appear in ACM TOPLAS.

UCN18. Charles S. Peskin and Olof B. Widlund. "Remarks on Efficient

Numerical Algorithms for Ultracomputers" , Ultracomputer Note No. 18,

Courant Institute, NYU, January 1981.

UCN19. Charles S. Peskin. "Ultracomputer Implementation of Odd-Even

Reduction", Ultracomputer Note No. 19, Courant Institute, NYU, January

1981.

UCN20. Charles S. Peskin. "A Comparison of Ultracomputer Architecture and

Lattice Architecure for Problems on Lattices", Ultracomputer Note

No. 20, Courant Institute, NYU, January 1981.

UCN23. David Korn. "Scientific Code Conversion", Ultracomputer Note No. 23,

Courant Institute, NYU, January 1981.

UCN24. David Korn. "Timing Analysis for Codes Run Under the Washcloth

Simulator", Ultracomputer Note No. 24, Courant Institute, NYU, March

1981.

UCN28. Marc Snir. "Netsim - Network Simulator for the Ultracomputer",

Ultracomputer Note No. 28, Courant Institute, NYU, May 1981.

UCN33. B. D. Lubachevsky. "Verification of Several Parallel Coordination

Programs Based on Description of their Reachability Sets",

Ultracomputer Note No. 33, Courant Institute, NYU, July 1981.

UCN40. Allan Gottlieb, Ralph Grishman, Clyde P. Kruskal,

Kevin P. McAuliffe, Larry Rudolph, and Marc Snir, "The NYU