B. D Lubachevsky.

Parallelizing an algorithm of Charles S. Peskin for describing incompressible flow of a fluid coupled to an elastic membrane online

. (page 2 of 3)
Font size minimal value C*(l) and the maximal value C*^^ of C*(P), P = 1,..,32,

nia.x

are also shown. In table 5.1 values of T, K, and C* are given in units
of thousands of simulated CDC-6600 cycles (for example, according to
table 5.1 the maximal execution complexity of a problem with

^'According to C. Peskin, at least two membrane points must be placed in
each mesh Ax^Ax square whose opposite sides are "pierced" by the
membrane. Since N2 and N^ are powers of 2 the coefficient r in the
equality N2 = rxN^ must be a power of 2. Considering a circular
membrane of diameter Ji as a model we can easily see that r = 4 is the
minimal value of r satisfying these limitations. On the other hand for
a given Nj^ an excessively large r (and hence N2) leads to excessive
computations on the membrane without increase of accuracy. Thus r = 4
is optimal for the circle. In the general case, the optimal choice of
r depends on the membrane geometry.

-23-

(N^, N2) = (4,16) was 3109 thousand cycles). The "error %" gives the
maximum over the 32 runs of the relative prediction error
|C(P)-C*(P) |/C*(P), the "error abs" is similarly the maximum of the
absolute prediction errors 1 (C(P)-C*(P) | . (Note that "error %" may be
greater than _ ^ since the maximums need not occur for the

same case)

max

PROBLEM SIZE
Ni,N2(#iter)

Ti 1 T2

1

T3

K

C (1)

error i
% 1 absl

C*(l)

C*
max

4, 16 (6)

7.6149.6

1.6

9.2

1.1%

.18%!

2.61

838.

2176.

8, 32 (5)

17.3(48.0

3.4

8.5

0.5%

.19%!

4.31

1701.

3882.

16, 64 (5)
4, 8 (5)

39.3153.1
7.4138.0

8.3
1.3

9.4
7.5

0.2%
2.2%

.16%!
.12%!

6.9|
.51

4111.
345.

6700.

=========

1734.

8. 16 (5)

17.4142.9

3.5

8.1

1.0%

.23%!

2.11

850.

2296.

16, 32 (5)

39.3|47.9

8.5

8.5

0.4%

.20%!

4.71

2246.

4712.

8, 8 (5)

55.0

3.3

7.5

1.6%

.12%!

.61

464.

2106.

16, 16 (5)

82.0

8.3

8.2

0.6%

.18%!

2.71

1395.

3152.

Table 5.1

The following conclusions were derived by examining the main series
experiments :

a) The model (5.3) allows the prediction of parallel complexity C(P)
very precisely once the serial complexities T's and K are known.

b) Generally all T's and K increase when the problem size increases

-24-

(."^2 fo'^ the (8,32) problem is slightly less than that for
(4,16) problem due to the smaller //iter).

c) Korn's coefficient K may increases with the problem size, however

5.3. Additional series of runs . As we will see in section 6 the
conclusion c) is especially important for obtaining the efficiency for
large-scale parallel computations. To increase our confidence in
statement c) two additional series of runs with N^ = 2 Ni and No = Ni,
respectively, were performed. By virtue of a) these additional series
were run for fewer number of ultracomputer sizes. Only 11 values of P
were picked, P = 1, 2, 3, 4, 6, 7, 8, 16, 24, 31, 32. Note that in the
series with N^ = Nj^ the linear equations whose coefficients are being
identified by a least square approximation do not allow the
determination of T^^ and T2 separately since the corresponding variables
have the same coefficient ^p(Ni) = Hp(N2) in each of the equations.
For this series the estimations of T, + To were computed.

The results of these additional series are also presented in
table 5.1 (in the second and the third sections which are separated by
double-dashed lines). These series confirm staement c) above.

5.4. Serial complexities . We next analysed how //iter, Korn's
coefficient, and the serial complexities depend on the problem
parameters.

-25-
The value of #iter is a function of all the parameters of the
problem, not only of problem size. Since this dependence is not known,
we would like a high efficiency for any value of //iter. In fact this
is the case; see section 6. The Newton's minimization algorithm was
set for at least two iterations, so #iter may vary from 2 to + o.

Trying to build a complexity model for Korn's coefficient, we
analyse all sources of complexity depending on P, the number of PEs .
The largest portion of this overhead, due to a mismatch between P and
the multiplicity of task spawning, need not be considered here since it
has already been included in the terms Hp(Mj^) x Tj^ of (5.2). Other

The REQUEST routine in (4.1) must be executed by each PE after all
currently spawned tasks have been assigned. The number of instances of
task spawnings (4.11) is constant for the part of the program dealing
with the mesh. Since task spawnings are embedded into the Newton
minimization loop which is executed #iter times, the number of
instances of task spawnings (4.11) is proportional to
(5.4) A + (B + C X lgN2) X //iter

for the part dealing with membrane. Therefore the overall overhead per
PE due to these factors can be expressed as (5.4).

Line 10 in SYNCH code^ has to be executed by all but one PE

''FORTRAN FTN compiler of the CDC-6600 considers line 10 as a mistake
(which it surely would be if the code was serial). The "parallel"
FORTRAN compiler should allow line 10. In the existing code "line 10"
are in fact two lines, using no-op CONTINUE statement.

-26-
subsequenC to #FREE_PES (INDEX) becoming 0; the overhead per PE can be
expressed in the form (5.4).

To make our prediction of K safer we include a term proportional

to N^ X //iter to the expression of K. (Obtaining a small coefficient
for this term would indicate that we did not raiss components of K
growing much faster than lgN2 x //iter.) Thus we come to the following
prediction formula

(5.5) K = k.Q + (ki + k2 X lgN2 + k3 x N2) x //iter

The four coefficients k^ in (5.5) were identified given 8
experimental values of K from table 5.1 using a least square method as
above. The results are presented in table 5.2, when the error columns
have the same meaning as in table 5.1.

^0 I ^1 I ^2 I k3 I error
I I I I % I abs

2.171.80 1.09 1.002 |2.3%| .19

Table 5.2

The experimental k^ is small enough. (Since one machine cycle
corresponds to the addition of .001 to a coefficient, this k_
corresponds to only two cycles.) This confirms our analysis of
complexity of K.

-27-
The analysis of the dependency of T^^ T2 and T3 on problem
parameters could be done by actually counting instructions in the
compiled code. This method however was rejected because of the large
length of the program (the compiled serial code occupies more than 6000
words, each containing up to 4 instructions). We tried to apply the
linear identification method to express T's as functions of the program
size. By analyzing table 3.1 one sees that the following expressions
might be candidates for identification their coefficients:

(5.6.1) Tj^ = agi + a-ii Nj^ + a2ixN]^lgN j^ -

(5.6.2) T2 = aQ2 + ai2'^*it^^ "*â–  a22'*^''iterxlgN2

(5.6.3) T3 = ao3 + aiS^N^ + a23xNixlgNi

Formulas (5.6) with their left-hand sides replaced by experimental
values obtained from table 5.1 allow us to generate a number of linear
equations with unknown a^^^.. (To use experimental values of T^ + T2 in
the series N^ = N2 one has to add (5.6.1) to (5.6.2).) 14 equations
were solved to identify 6 coefficients a. . in (5.6.1) and (5.6.2) and
(separately) 8 equations were solved to identify 3 coefficients a. - in
(5.6.3). The results of these identifications are presented in
table 5.3.

i

1 ^Oi

1 aii

1
1

a2i

1
1

error
% 1 abs

1

1 .34

1 1.29

1

.29

1
â€” 1 \

1
.94% 1 .16

1

2

1 9.30

1 2.65

1

1.02

1 ;
1

3

1 .61

1 -.06(

)l

.14

1

11.% 1 .17

Table 5.3

-28-
This table indicates good accuracy for a. , and a. o and low
accuracy for a.^^^, which means that models (5.6.1) and (5.6.2) are
satisfactory but model (5.6.3) is rather coarse. (Coefficient a,^ even
turns out negative!) This does not contradict table 3.1, which gives
only the order of complexity for each task. Since table 5.1 shows that

T3 is small, in (5.3) the term involving T3 is also small and we have
not refined this model.

-29-
6. The estimation of efficiency.

The aim of this section is to study how the efficiency of the
parallel algorithm depends on the number P of PEs and on the problem
size. The target is to predict efficiency for realistic size two and
three dimensional problems that are too large for us to simulate.

6.1. First let us recall the commonly accepted definitions of
efficiency. The ideal efficiency E^^^^-'- of an algorithm executed by P
PEs is given by

^best
(6.1) Eideal = s

P P T^TpT

where T^Â®^*^ is the time required by the best possible serial algorithm,
and T (p) is the time spent by the parallel algorithm with P PEs.

Since the best serial algorithm is usually not known, TÂ°^st ^^
(6.1) is replaced by Tgoo^, corresponding to the serial algorithm
accepted for the problem:

rpgood
(6.2) Egeal = s

P X Tp(p)

Note that the initial serial algorithm was accepted as "good" in our
problem. After omitting the superscript "real" and "good", (6.2) may
be rewritten as

(6.3) Eâ€ž =

Ts T (1)

X

P Tp(l) Tp(p-)

Df T
The ratio eâ€ž == __!_

-30-

measures the loss due to modifying the

serial code to make it parallel. Ratio ep ==

Df Tâ€ž(l)

P '^ Xp(P)

measures the

loss due to execution of the parallel code by P processor elements.
Our aim now is to evaluate e^ and ep for various values of P.

6.2. Evaluation of e^^. Table 6.1 contains empirically obtained
values of e^ for various problem sizes. The values of Tg used for
calculating e^ were obtained by running the serial code in a special
tracing mode provided by the simulator [UCN12].

N^, N2 (//iter)

4, 16

(6) 1

8, 32

(5)

16, 64

(5)

^0

.925

1

.934

.943

N^, N2 (#iter)

4, 8

(5) 1

8, 16

(5)

16, 32

(5)

^0

.915

I

.933

.948

N^, N2 (#iter)

1

8, 8

(5)

16, 16

(5)

!o

1 .933

.956

Table 6.1

Table 6.1 shows that eg steadily increases toward one when both
parameters Nj, and N2 characterizing the problem size increase, keeping
N2/N]^ fixed. Since the values are not far from one, we conclude that
converting of the serial code into parallel code causes low overhead.

-31-
One can also observe the following irregularity in table 6.1: for
N^ = 4 and 8 the value of eg increases when N2 increases, whereas for
N, = 16 eg decreases when N2 increases.

This effect could be explained by expressing Cq as the weighted

1 2
sura e^ = Aixeg + ^2^^0Â» where Xj and X2 ^^^ the weights of the

execution complexities of the subcodes corresponding to the mesh and

1 2
membrane, respectively, X. + X2 = 1, eg and eg are the efficiencies of

the respective subcodes. These partial 6g's and X's depend on the

problem size. Both partial eg's generally increase with the size of

the problem. However the speed of this increase may be different for

different sizes.

There would be no contradiction with table 6.1 if, for the

2 1
interval of N2 given in that table, the inequality eg > eg held for

smaller N. and the opposite inequality held for larger Nj^. Then an

2
increase in N2 with Nj^ fixed would cause an increase in Xg, and this

would also increase the total e^ for small Nj^ and decrease the total eg

for large N. .

6.3. Efficiencies for small 2-D problems . Table 6.2 contains
empirical and predicted efficiencies ep, P _> 2, for the three regular
(i.e. N2 = 4xNj^) smallest sizes of the 2-D problem. The prediction
was based on the formula

-32-
where C (P) and C (1) are obtained from (5.3) with coefficients from
table 5.1. The predicted values are given in parenthesis. The
agreement is seen to be good.

p

PROBLEM SIZE

Np N2

(//iter)

2

4, ]
.9867

L6 (6)
(.9872)

8, :

= ^^ = = = =

.9927

52 (5)
(.9930)

16,
.9955

64 (5)
(.9957)

3

.8629

(.8632)

.9520

(.9520)

.9527

(.9528)

4

.9644

(.9663)

.9789

(.9793)

.9870

(.9872)

7

.7156

(.7159)

.8475

(.8476)

.8713

(.8702)

8

.8888

(.8906)

.9585

(.9605)

.9707

(.9707)

15

.4756

(.4750)

.6541

(.6542)

.7579

(.7576)

16

.7691

(.7700)

.8466

(.8484)

.9523

(.9538)

31

.3974

(.3974)

.4383

(.4379)

.6135

(.6131)

32

.3850

(.3850)

.6866

(.6878)

.7864

(.7871)

48

(.2566)

(.4584)

(.5248)

64

(.1925)

(.3438)

(.5833)

Table 6.2

6.4. Estimating efficiency for large size problems can not be done
by using formulas (5.3) and (6.4) directly since #iter is not known.
If one substitutes (5.5), (5.6) into (5.3) then efficiency appears as a
rational function E(#iter) of the form

E(#iter) =

A X //iter + B
C X ^;iter + D

where the coefficients A, B, C, D depend on the problem size parameters

-33-
and on P. Since #iCer may vary from 2 to +00 and A, B, C, D > 0, the
(asymptotic) minimum of E(#iter) is E(+oo) = A/C, if AC - BD < 0, and
is E(2) = (2xA+B)/(2xC+D) , otherwise. This minimum is what we will
call Eg, the optimistic estimate of ep.

Since predictions for the serial complexity T^ is rather crude we

p
now develop a more conservative estimate Ep, which we call the

pessimistic estimate. Note that the parallel complexity model (5.3) is

considered accurate in both policies.

Replacing C (P) and C (1) in (6.4) by their expressions from (5.2)
and carrying out the obvious algebraic transformation one can obtain
the following expression for e^:

â– L ,â€˘ 1
(6.5) e = Z X X e^ + Xq X ,

where the weights X^, i = 1,,..,P, are given by

Hp(Mi) X Ti

(6.6) X.

P =p(Mi) X Ti +. ..+ Hp(Mp) X Tl + P X K
and the partial efficiencies ei are given by

(6.7) e^ = J_J_.

Formula (6.6) is valid for i = 0, if one takes K as Tq and P as Xp(MQ),
After discarding the term Xâ€ž x _ in (6.5) one has:

-3A-

L ,. L L

(6.8) e > z X. X ep = (2 >^0 (^ ^^-i >" ep)

^ i=l ^ i=l ^ 1=1 ^ ^

where new weights M^'s are given by \iÂ± = o^ else

E X.

Hp(Mi) X Ti
(6.9) M,

1 -p(Mi) X Ti +...+ =p(Mp) X Tl

Note that

(6.10) ix. = ' ^"\- "^^ > ^ ^^^:/^^ = i-p^^.

i=i 1 TTF) - TTTl cTTT

L
Substituting the expression for E X from (6.10) into (6.8), one gets

i=l -^

the following lower bound on ep for large size problems:

K

L

(6.11) ep > (1 - P X _Â±_) X (Z M. X e^)

where U^'s are given by (6.8) and ep's are given by (6.7).

In the pessimistic policy we estimate the ratio _^___ using the

C (1) ^

minimal of its observed values in the small size experiments (see
section 5), i.e. by 0.2% and in (6.11) the "vi .-weighted" sum is
replaced by the minimum of e^ over all i = 1,...,L. Hence

(6.12) Ep = (1 - P X 0.002) X (min (eâ„˘, e^, e^^ ) )

where e^, ep, ep are partial efficiencies corresponding T^^, T2 and T3,
respectively.

-35-

Entries in table 6.3 above are E(5)/EÂ°/Ep. Symbol " â€” " replaces
negative E^.

p

PROBLEM SIZE N^, N2

2

32, 128
.997/. 996/. 941

64, 256
.998/. 997/. 967

128, 512
.999/. 998/. 981

256, 1024

===============

.999/. 999/. 988

4

.991/. 989/. 843

.994/. 993/. 909

.996/. 995/. 948

.998/. 997/. 969

8

.98/. 97/. 70

.99/. 98/. 81

.99/. 99/. 89

.99/. 99/. 93

16

.96/. 95/. 51

.97/. 96/. 67

.98/. 98/. 77

.99/. 99/. 87

32

.94/. 94/. 50

.94/. 93/. 48

.96/. 95/. 63

.98/. 97/. 75

64

.71/. 62/. 23

.93/. 92/. 45

.92/. 91/. 44

.95/. 95/. 59

128

.47/. 38/. 10

.62/. 55/. 19

.91/. 91/. 38

.91/. 90/. 37

256

.23/. 19/. 03

.37/. 31/. 06

.55/. 50/. 12

.90/. 90/. 25

512

.12/.09/~

.19/.15/~

|.30/.27/~

.50/. 47/ â€”

1024

.06/.05/~

.09/.08/~

|.15/.13/~

I. 26/. 24/ -

Table 6.3

Data in table 6.3 show that parallelization is efficient in the
2-dimensional case for up to a few hundred PEs .

6.5. Efficiency prediction for 3-D problems . In the 3-D case, the
membrane is 2-dimensional and many different structures are possible.
We suppose, however, that the membrane is reinforced by a collection of
one-dimensional fibers and that the minimization problem (2.3)-(2.5) is
to be solved independently on each fiber in this collection. Such a
collection can be treated as a single large fiber. (This may not be

-36-
the most efficient method, but it is the one we can analyse without
further programming.)

Let the total number of membrane points be N2 = 8 X N]^~. This
relation may be considered as optimal in the considered case (cf.
footnote on p. 22). We will assign the same amount of work for each
task, but replace multiplicities N. by Ni for tasks dealing with mesh.
(The multiplicity ___ + 1 of a non-standard FFT task must be replaced by
N, X ( - - + 1).) Orders of complexities of tasks remain the same as in
table 3.1. Whereas the number of invocations required and thus the
values of coefficients a^. in (5.7) may change, we will not try to
predict a^. and k^^^ in the 3-D case. Instead we use the 2-D values.
Table 6.4 presents the EÂ° predictions obtained.

Note that Ep increases when a^. increase. (This may be proven
analytically. Since the formula for computing EÂ° was implemented as a
computer program, we prefered to check it experimentally. For the
problem size N^ = 256, N2 = 524288 the values of the a^js were
independently and uniformly varied in the range from the nominal value
to the 150% of it. All 1000 computed values of trial EÂ°â€žâ€ž, appears to
be greater than the nominal value of this quantity.)

-37-

p

PROBLEM

SIZE N^, N2

2

32, 8192
.9999

1 64

32768
.9999

128, 131072
.9999

256, 524288
.9999

4

.9996

.9997

.9997

.9997

8

.9991

.9993

.9993

.9994

16

.998

.998

.999

.999

32

.996

.997

.997

.997

64

.992

.993

.994

.995

128

.984

.987

.988

.989

256

.968

.974

.977

.979

512

.937

.949

.955

.959

1024

.882

.902

.913

.921

2048

.692

.821

.840

.854

4096

.471

.697

.724

.744

8192

.288

.527

.568

.593

Table 6.4

Data in table 6.3 shows that parallelization in the 3-dimensional
case is efficient for up to a few thousand of PEs . In particular, a
problem of size (64,32768) (a size C. Peskin believes to be of
interest) can efficiently utilize an entire Ultracomputer consisting of
4096 PEs.

-38-

7, The pattern of the Ultracomputer activity
with network delay simulation.

Several additional runs of the program were simulated more
accurately by including delays in the network connecting PEs with
shared memory (SM). This was done using the NETSIM simulator [UCN28]
for a 6-stage network of 4 4 switches (see [UCN40]). This corresponds
to the Ultracomputer with 4096 PEs. In our simulations a given
fraction of the Ultracomputer (4, 8 and 16 PEs, respectively) was
executing our program while the rest of the Ultracomuter was idle, i.e.
did not produce additional network traffic.

Note that a request to the shared memory (SM) on an unloaded
nerwork proceeds for 8 cycles, so that the delay is at least 7 more
cycles than in the simulations described above.

The overall additional delay of the simulation and various memory
referencing statistics for runs on a problem of sizes N, =8, No = 32
(//iter = 5) are represented in following table 7.1.

The effective average delay per SM load (row 7 in table 7.1) was

computed as

#PEs X (tj - tâ€ž)

where t^, t^j are times of execution without (row 1) and with (row 2)
network delay, respectively; //SM loads is computed as #SM references

-39-

(row 4) multiplied by %SM_loads (row 5). This delay is less than the
network latency due to prefetching.

//ROW

STATISTICS

time of execution
without delay

NUMBER OF PES EXECUTING

THE PROGRAM

1

4
434526

8
221885

16
125603

2

time of execution
with delay

581570
(+34%)

296630
(+34%)

172797
(+38%)

3

# of data memory
references

596480

589448

671314

4

^^ of SM references

134388
(23% of all)

142639
(24% of all)

180101
(27% of all)

5

% of all of SM

82%/15%/3%

82%/15%/3%

84%/13%/3%

6

average delay per

9.54/9.55/10.4

8.97/8.96/9.41

8.76/8.77/9.03

7

effective average

6.36

6.14

6.17

8

% of all of local

70%/30%

70%/30%

7 2%/ 2 8%

Table 7.1

The delay through the network is another factor decreasing
performance of the parallel algorithm. According to table 7.1 this
delay is not excessive. It seemingly does not increase when degree of
parallelization P increases if P remains less than N. .

-40-

It is likely that idle waiting creates more network traffic than
normal processing. This is confirmed by:

1) For P > N^ idle waiting intensifies the stream of idle SM
references inside SYNCH loop 10 (see fig. 3.1) whose partial delay must
be greater than the average;

2) The number of SM references noticably increases for case of 16
PEs; while the fraction of SM loads also increses. Note that one
execution of loop 10 at SYNCH routine corresponds to one SM load;

3) The delay increses when number of PEs increases from 8 (delay
34%) to 16 (delay 38%).

Note that if the operating system uses idle PEs for other jobs the
overall delay might decrease.

8. Conclusion.

The algorithms of the considered class can be easily adapted for
parallel processing on the Ultracomputer , using task spawning mechanism
for loop parallelization. Various simulation results presented show
high efficiency of the execution with hundreds of PEs in the 2-D case
and with thousands of PEs in the 3-D case. The degree of
parallelization of successive task spawnings varies and the problem
execution generates numerous busy waits. Therefore the throughput of
the machine will be improved without lowering the speed-up for the main
job, if several small low priority jobs are running concurrently with
it.

-41-
Acknowledgement .

The author thanks Gabi Leshera for an essential programming help in
writing and debugging the subroutines for minimizing the tension
function. Gabi also noted that subroutine SYNCH (originally suggested
by Malvin Kalos and the author and analyzed in [UCN33]) can be adapted
for implementing SPAWN-WAIT mechanism to have initialization of spawn
parameters.

Malvin H. Kalos and Charles S. Peskin greatly improved the quality of
the exposition.

-42-
REFERRENCES

Pes. C. S. Peskin. "Numerical Analysis of Blood Flow in the Heart",
Journ. Coraput. Physics, Vol.25, pp. 220-252, 1977.

UCN12. Allan Gottlieb. "Washcloth - The Logical Successor to Soapsuds",
Ultracomputer Note No. 12, Courant Institute, NYU, October 1980.

UCN13. Allan Gottlieb. "MOP - A (Minimal) Multiprocessor Operating System
Extending Washcloth", Ultracomputer Note No. 13, Courant Institute, NYU,
November 1980.

UCN16. Allan Gottlieb, B. D. Lubachevsky, and Larry Rudolph. "Basic
Techniques for the Efficient Coordination of Very Large Numbers of
Cooperating Sequential Processors", Ultracomputer Note No. 16, Courant
Institute, NYU, December 1980; to appear in ACM TOPLAS.

UCN18. Charles S. Peskin and Olof B. Widlund. "Remarks on Efficient
Numerical Algorithms for Ultracomputers" , Ultracomputer Note No. 18,
Courant Institute, NYU, January 1981.

UCN19. Charles S. Peskin. "Ultracomputer Implementation of Odd-Even
Reduction", Ultracomputer Note No. 19, Courant Institute, NYU, January
1981.

UCN20. Charles S. Peskin. "A Comparison of Ultracomputer Architecture and
Lattice Architecure for Problems on Lattices", Ultracomputer Note
No. 20, Courant Institute, NYU, January 1981.

UCN23. David Korn. "Scientific Code Conversion", Ultracomputer Note No. 23,
Courant Institute, NYU, January 1981.

UCN24. David Korn. "Timing Analysis for Codes Run Under the Washcloth
Simulator", Ultracomputer Note No. 24, Courant Institute, NYU, March
1981.

UCN28. Marc Snir. "Netsim - Network Simulator for the Ultracomputer",
Ultracomputer Note No. 28, Courant Institute, NYU, May 1981.

UCN33. B. D. Lubachevsky. "Verification of Several Parallel Coordination
Programs Based on Description of their Reachability Sets",
Ultracomputer Note No. 33, Courant Institute, NYU, July 1981.

UCN40. Allan Gottlieb, Ralph Grishman, Clyde P. Kruskal,

Kevin P. McAuliffe, Larry Rudolph, and Marc Snir, "The NYU

2