Lori S Grob.

Automatic exploitation of concurrency in C: is it really so hard? online

. (page 1 of 2)
Online LibraryLori S GrobAutomatic exploitation of concurrency in C: is it really so hard? → online text (page 1 of 2)
Font size
QR-code for this ebook


Automatic Exploitation Of Concurrency in C: Is It Really So Hard?

Lori S. Grob

Ultracomputer Note #140
June, 1988


o •
•w •

4J •
(0 •*


o c

O CO X >i
rH -H C

I U o cu

2 O "H
CJ iJ -P

- s

XJ o
D O -P

>l S-l 3

z a <

Ultracomputer Research Laboratory




New York University

Courant Institute of Mathematical Sciences

Division of Computer Science

251 Mercer Street, New York, NY 10012


Automatic Exploitation Of Concurrency in C: Is It Really So Hard?

Lori S. Grob

Ultracomputer Note #140
June, 1988

This work was supported in part by the US Department of Energy under grant number DE-FG02-
88ER25052 and in pan by IBM contract number N00039-84-R-0605.

Preliminary versions of this paper were presented at the Association Francois* des Utillsateurs a" Unix, Paris,
France, March 1988 and at the European Unix Systems User Group Workshop, Dublin, Ireland, Autumn 1987.


The vast majority of work being done today on the automatic exploitation of
concurrency, for both multiprocessors and vector machines, is not being
done for C. Yet there are many important applications, written in C, which
would benefit immensely from the speedup produced by such techniques.
Many more applications would be written in C were there optimizing, vec-
torizing and parallelizing compilers for the language.

In this paper we consider whether it is possible to have compilers that will
automatically exploit concurrency in C. We discuss the relationship between
automatic exploitation of concurrency for the purposes of vectorizing and
multiprocessing. We review the basic techniques and transformations used
and examine the necessary conditions to perform these transformations, with
examples in C. Several elements of the C language and programming style,
such as pointers and recursion, make it difficult to do the necessary data
flow analysis. There are two possible approaches to this problem: to bypass
code or blocks of code that contain "difficult" features and be unable to ap-
ply optimizations to these fragments, or to suitably restrict the language.
We examine the choices made by the few available vectorizing and parallel-
izing C compilers and consider what the future may hold in the light of
current research.

1. Introduction

The goal of automatic parallelization, whether for multiprocessors or for vector
machines, is to take a program with serial semantics and have the compiler or preprocessor
produce a parallel program.

The programmer avoids having to deal with the difficulties of synchronization and
other issues that arise in the writing and debugging of explicitly parallel programs (Grob and
Lipkis [86]). There is a guarantee that the semantics of the program will be preserved and
that the resulting program will be correct. The arguments that are used for stating the need
for and advantages of optimizing compilers are applicable here as well. The compiler will
be able to find parallelism that the programmer does not see. Programmers who set out to
solve a problem shouldn't have to be experts in the kinds of data flow techniques used to
find non-obvious potential parallelism, or in the transformations necessary to exploit paral-
lelism. A parallelizing compiler will be able to work on library routines and other functions
that were not written by the programmer. However, these arguments do not eliminate the
usefulness or the desirability of explicit parallel programming. It is often useful for a pro-
grammer to be able to manipulate the parallelism and control the asynchrony him or herself.

Because the potential benefits of automatic parallelization are considerable there is a
great deal of interest in the topic. However, most work on the compilers that discover and
exploit concurrency is being done on FORTRAN and not on C. There are features of the C
programming language that make it difficult to do the analysis necessary to optimize, much

less exploit concurrency. These features are unconstrained pointers and other forms of
aliases. The first few attempts to write parallelizing or vectorizing compilers for C either
restrained the language by limiting the use of pointers or didn't attempt to parallelize state-
ments or blocks of code that contained pointers.

In this paper, we will attempt to explain what it is necessary to know about a program
in order to parallelize it safely and why the C language poses problems. We will also give a
few approaches to the problem that appear potentially promising. In order to present these
issues clearly, we will first present an overview of the necessary background material.

2. Architectural Models

We are concerned with automatic exploitation of parallelism for three basic architec-
tural models: vector processors, multiprocessors and very long instruction word (VLIW)
machines. The techniques and constraints involved in finding parallelism for these architec-
tures are similar although the architectures themselves are quite different.

2.1. Vector Processors

Conceptually, the idea behind a vector processor is quite simple. An operation is per-
formed with two arrays as the operands. Instead of a loop iterating through all the elements
of the arrays, all the elements of the arrays are processed in parallel. For many kinds of
programs, which spend the bulk of their execution time on vector operations, this can be a
significant speedup.

Architecturally, what happens in most modern vector processors is somewhat different.
The elements of the arrays that are the operands of the vector operation are passed into a
pipeline of processors. Each processor performs a part of the primitive operation on what-
ever data it is given. The data then moves on to the next processor which performs its part
of the operation. At the same time the previous processors are performing on other data. It
takes some amount of time for the first operands to move through the pipeline. There is a
result on every tick thereafter.

2.2. Multiprocessors

The multiprocessors referred to in this paper are MIMD 1 machines. MIMD machines
consist of many processors working together on a single job. Each processor may operate
autonomously, not in lock step with the other processors. The processors may or may not
share memory. They may communicate over a bus or a network.

*In the taxonomy of Flynn [66], MIMD is the category of Multiple Instruction stream. Multiple Data stream
computers (i.e. asynchronous multiprocessors); whereas in SIMD designs there is just a Single Instruction stream
for the Multiple Data streams.

Ultracomputer Note 140 Page 2

2.3. VLIW Machines

VLIW machines have a word size large enough to hold multiple machine instructions.
Instructions loaded into the word at the same time are executed in parallel. The difficulty is
in deciding which instructions may be executed in parallel and in restructuring the program
to execute as many instructions as possible in parallel without changing the results of the

2.4. Hybrids

Although these are the three architectures that we are going to discuss, the categories
are not absolute. There are vector machines that contain more than one set of vector proces-
sors and do vector processing in parallel and there are multiprocessors and VLIW machines
that have vector units.

3. Automatic Parallelization

The potential for parallelism arises from the fact that some statements or operations do
not have to be executed in the order that they are written. The ordering is often random or
at least imposed only by the necessity of putting them in some linear sequence. The relative
order of every statement is often not necessitated by the problem. If a group of operations
or statements are independent of each other in terms of flow of control and in terms of data,
then they can be executed in at the same time.

3.1. Levels of Parallelism

Parallelism can be found at many different levels in a program. Expression level paral-
lelism refers to the evaluation of several different parts of the expression at the same time.
For instance, in an expression containing addition operations and multiplication operations
the various operands could be evaluated at the same time, subject to the rules of precedence.
This is the type of parallelism found in VLIW machines. Further, in VLIW machines, there
is expression level parallelism for many expressions happening at the same time. There are
also statements that can be run in parallel. There is loop level parallelism, where the
iterates are run in parallel. Ideally, it is desirable to find all the parallelism in a program,
no matter at what level it exists. It may be questioned whether it is worth executing some-
thing in parallel if it is only a single expression. The answer to this depends on the architec-
ture and the implementation of those features that create and control parallelism.

Numerical programs and other scientific applications are often very regularly struc-
tured. The computation is often done in a series of loops. The data structures are arrays or
matrices. This type of program lends itself very well to automatic parallelization and vector-
ization as generally practiced.

Non-numeric applications may have a less regular structure. But if a program is written
with the idea that it is going to be automatically parallelized, that is, bearing in mind the

Ultracomputer Note 140 Page 3

types of transformations done and the constructs that can be parallelized, these applications
can get significant speedup (Lee, Kruskal and Kuck [85]). When non-numeric applications
with irregular shapes are parallelized the speedup obtained is likely to be less significant.
Because VLIW machines exploit fine grain or expression level parallelism they are less sus-
ceptible to this problem.

Most vectorizing compilers and parallelizing compilers for multiprocessing look for
parallelism at the loop level. The vectorizing compilers attempt to parallelize the innermost
loop in a nested construct. Each statement in the innermost loop will be vectorized
separately. The compilers for multiprocessing attempt to parallelize the outermost loop in
order to minimize synchronization and to achieve optimal parallelism. The goal of here is to
execute loops in parallel with all the iterates being done at once. It is loop level parallelism
that we are going to discuss.

3.2. When Can A Loop Be Executed In Parallel?

In order to understand the complexities of parallelism of this sort, one must picture all
the iterations of a loop executing in parallel and at different speeds. The danger is that
there may be a dependence existing between several statements in the same or different
iterations. The existence of a dependence may make it necessary to have stores and fetches
of various elements completed in a prescribed order. Vectorizing or multiprocessing may
allow that ordering to be violated, causing the program result to be incorrect. So it is neces-
sary to discover the dependences that exist and decide if they are of a type that will prevent
the vectorization or parallelization of the loop. The dataflow analysis that is necessary to
prove that no dependence exists is a superset of the kinds of dataflow analysis that are done
for traditional optimization.

3.2.1. Dependences

Dependence analysis is important in many areas of optimization and critical in
automatic parallelization, including deciding which statements may be loaded into a single
VLIW instruction and executed together. A data dependence exists between one statement
and another or between a statement and itself when the same memory location or elements
of the same array are accessed in those statements. This may enforce an ordering or seriali-
zation of the statements in which the dependence exists.

Not all dependences prohibit parallelization or optimization. In some cases, the code
must be transformed and the dependence removed before any parallelization or optimization
can occur. Sometimes all the dependent code must be moved or treated as a unit. In other
cases, the dependence will preclude any parallelization or other potential alteration in the
sequence of execution.

The data dependences which occur between statements have been put into three classifi-
cations: flow dependence, anti-dependence, and output dependence (for more background on

Ultracomputer Note 140 Page 4

dependence see Wolfe[82] and Burke and Cytron[86]).

A flow dependence means data will "flow" forward from one statement to another.

An example of a flow dependence is:

for (i = 0; i< = 100; i++) {
x = w + z;
a[i] = x;

The second statement is dependent upon data from the first. The store must be completed
before the load.

An anti-dependence is the reverse of a flow dependence. The load precedes the store.
In order to get a correct value stored the order of the statements must be preserved.

for (i=0; i


Online LibraryLori S GrobAutomatic exploitation of concurrency in C: is it really so hard? → online text (page 1 of 2)