Gerald Weiss.

Recursive data types in SETL: automatic determination, data language description, and efficient implementation online

. (page 1 of 9)
Online LibraryGerald WeissRecursive data types in SETL: automatic determination, data language description, and efficient implementation → online text (page 1 of 9)
Font size
QR-code for this ebook


Computer Science Department



TECHNICAL REPORT



RECURSIVE DATA TYPES IN SETL: AUTOMATIC

Termination, data language description,

AND EFFICIENT IMPLEMENTATION



By

Gerald Weiss

Technical Report #201
March 1986





m






TR-201

a types in
c.l






In "O

lO >-!
ICQ

P



(0
(0 'O

u
(9



>







Eh

b



QJ CO



NEW YORK UNIVERSITY







Department of Computer Science
Courant Institute of Mathematical Sciences

251 MERCER STREET, NEW YORK, N.Y. 10012



r



RECURSIVE DATA TYPES IN SETL: AUTOMATIC
DETERMINATION, DATA LANGUAGE DESCRIPTION,
AND EFFICIENT IMPLEMENTATION

By

Gerald Weiss

Technical Report #201
March 1986



Recursive Data Types in SETL:

Automatic Determination,

Data Language Description,

and Efficient Implementation

Gerald Weiss

October 1985



A dissertation in the Department of Computer Science submitted

to the faculty of the Graduate School of Arts and Science in

partial fulfillment of the requirements for the degree of

Doctor of Philosophy at the Courant Institute of

New York University.



Work supported in part by the Office of Naval Research, Grant NO0014856K0413.



© Copyright 1985 Gerald Weiss



Table of Contents



Chapter 1: Introduction 1

1.1. The Type Model of SETL 2

1.2. Recursive Data Types 3

1.2.1. Pointers 4

1.2.2. Recursive Structures in PL/I 5

1.2.3. Recursive Structures in Ada 6

1.2.4. Recursive Structures in LISP 6

1.2.5. Recursive Structures in SETL 8

1.3. Previous Work 11

1.4. Thesis Overview 13

Chapter 2: A Typefinding Algorithm for Recursive Data Structures 14

2.1. An Informal Introduction to Typefinding in SETL 15

2.2 Terminology 1"

2.2 1. The BREACHES Chaining 19

2.3. Tenenbaum's Typefinding Algorithm 20

2.4 Limitations of Tenenbaum's Typefinding Algorithm 23

.a Accurate Typefinding Algorithm in the Presence of Recursive Data Struc-
tures 27

2.6. Termination of the Typefinder 34

2.7 Completeness of the Algorithm 39

2.8. Choice of the Recursive Instance 43

2.9. Typetesting Predicates and Typefinding 44

2.10. The Backwards Pass 45

2.11. Comparison with Other AJgorithms 48

2.11.1. Tenenbaum's Typefinder 48

2.11.2. Type Unification 50

2.12. Applicable Optimizations 51

2.12.1. Storage Applications 53

2.13. Addendum: The Advantages of Not Assigning Recursive Symbols to Ovari-

ables of Embedding Operations 53

Chapter 3: Explicit Typing in SETL: The Representation Sublanguage 56

3.1. Basings 57

3.2. A Critique of the Representation Sublanguage 59

3.2.1. Reprs and Strong Typing 59

3.3. Extending the Representation Sublanguage 64

3.3.1. Recursive Specifications 64



1]



3.3.2. Type Alternation 65

3.3.3. Polymorphic Procedures 67

3.3.3.1. The Generic Facilities of PL/I and Ada 68

3.3.3.2. Polymorphic Procedures in SETL 68

3.3.4. Exclusive Alternation 71

3.3.5. Typing of Constants 72

3.3.6. Sequences and Tuples 73

3.3.7. Field Names for Heterogeneous Structures 73

3.4. Summary 75

Chapter 4: A SETL Implementation of the Typefinder 78

4.1. Introductory Remarks 78

4.2. The Algorithm 79

Chapter 5: Examples 105

5.1. Example 1: Building a linked list using tupleformers 105

5.2. Example 2: A pair of mutually recursive structures 107

5.3. Example 3: The with operator: recursion on the first operand 108

5.4. Example 4: The with operator: recursion on the second operand 109

5.5. Example 5: The with operator: recursion on both operands 110

5.6. Example 6: A structure constructed via component assignment Ill

5.7. Example 7: A binary tree - uniform node structure 112

5.8. Example 8: A binary tree: nonuniform node structure 114

5.9. Example 9: a simple recursive descent parser 116

Chapter 6: Conclusions and Further Research 123



ABSTRACT

Very high level languages are often weakly typed in the sense that different occurrences
of a name can be associated with distinct types. The types of many entities are nevertheless
determinable from the structure of the program, allowing translators for these languages often
to incorporate some sort of typefinding algorithm. Due to problems of algorithmic termina-
tion, however, these algorithms have been unable to type structures of a recursive nature such
as trees. In this thesis we present a method which detects and uncovers the structure of recur-
sive objects, and discuss possible applications of the method to optimization of code. We
examine the run-time type model of SETL and the corresponding data representation sub-
language (DRSL), and present a general critique of the design as well as implementation of
the current data representation sublanguage. The objects expressible by the latter are shown
to be proper subsets of the universe of types assumable by SETL entities at run-time; we
present suggestions for extending the representation sublanguage to allow for complete type
specification.



Acknowledgements

I would like to express my sincere and deep gratitude to my thesis advisor, Professor Ed
Schonberg for his support and guidance. His counseling and advice repeatedly proved to be
invaluable to the development of this thesis. I would also like to thank the members of the
NYU ADA/ED project, in particular Brian Siritzky and Matthew Smosna, for their help and
the many fruitful discussions I have had with them.

I am grateful to my parents for instilling in me the desire to pursue an academic career.
Finally, I would like to thank my wife, Fern, and children, Yocheved and Zvi, for their pati-
ence and encouragement during the development of this thesis. Without their support I could
never have accomplished this.



CHAPTER 1



Introduction



1. Introduction

The semantic level of a programming language is characterized by the degree to which a
programmer must pay attention to implementation details: the higher level the language, the
less need be supplied and the more natural the expression of the algorithm. In order to facili-
tate the natural expression of algorithms, many high level languages free the programmer
from the burden of statically typing program variables through declarative statements. In
addition, data types far removed from those actually implementable via the hardware are pro-
vided, again to mask questions of implementation. Thus, compare the classical concordance
program written in PL/I, where table handling is the programmer's responsibility, and in
SNOBOL where the table is a language-defined data type.

SETL is a set-oriented language developed and implemented at New York University. It
is weakly typed and declaration free, and most of its operators are overloaded. As a conse-
quence, there is a substantial overhead in run-time type checking, and only interpretive code
can be profitably generated by the SETL translator. To remove the burden of run-time type-
checking, and allow the translator to generate efficient machine code, a typefinding algorithm
is needed to infer the types of program entities from their use. The static information that
the typefinder yields can then be used by a data-structuring module to find the most efficient
internal data structures with which to represent composite data objects (sets, tuples and
maps). This thesis examines in detail the typing issues present in SETL and proposes sub-
stantial extensions to existing type-finding algorithms and to the current type structure.



1.1. The Type Model of SETL

The type model of a language is one of the central issues of the design of the language.
Depending upon the type structure, bindings between objects and the types allowable by the
language may be possible at translation time or may have to be postponed until run-time. A
weakly typed language is defined as one in which an object may assume different types dur-
ing the course of its lifetime. A strongly typed language, on the other hand, disallows such
freedom, requiring objects to have a single type. This type may be the union of two or more
simpler types but the strong type model requires a controlled mechanism (e.g. the discrim-
inant in an Ada variant record) for determining which of the possibilities is currently in force
. This notion of strong vs. weak typing is independent of whether or not the language is
declaration-free. The programming language B [Meer85], for example, which is strongly typed
is nevertheless declaration-free. The distinction between strong and weak typing closely
reflects the contrast between mainstream languages (e.g. Pascal, Ada and PL/I) and so-called
very-high-level languages.

SETL allows the components of data aggregates (i.e. sets, tuples and maps) to them-
selves be aggregates. This nesting, or embedding, can be extended to an arbitrary depth. Due
to this facility, there is a rich variety of types that can be assumed by entities in a SETL pro-
gram. The introduction of such structures into a strongly typed language is problematic in
that the shape and structure of such entities is most often unknown until run-time and can
therefore not be typechecked. Furthermore, the set of possible shapes such structures can
assume may be infinite posing serious problems in declaring them. There is no problem if
pointers exist in the language but SETL (and APL) have no such notion. Even APL, which is
weakly typed, is unable to incorporate such objects into the language because it restricts
aggregates to be rectangular arbitrarily dimensioned arrays with scalar component types
rather than allowing arrays of arrays. SETL itself has had difficulties with such structures in
that the automatic typefinder incorporated into the optimizer is unable to assign them any



form of meaningful type, and the representation sublanguage, which provides the programmer
with the facility to declare names in the program contains no facility for declaring such enti-
ties.

1.2. Recursive Data Types

This section focuses upon the class of data types that are typically given a recursive
definition. We examine two basic methods of viewing such structures in various programming
languages.

The common definition of a recursive data structure, RD, is one whose components are
homologous to RD. The term recursive often has another meaning within the context of data
• ructures. Given a linked list, we often say that it is recursive if there is a cycle within the
link structure of the list (e.g. a circular list). This definition is of no interest to us and as such
we denote all lists, circular or not, as recursive.

The user's view of a recursive structure within the context of a particular programming
language is biased by whether the language is pointer- or value-oriented. In a pointer
language (such as PL/I or Ada), composite structures whose structure vary dynamically are
not directly supported, but rather are built up of simple structures linked together by
pointers. Those value languages (languages in which objects are not shared, but must rather
be copied) which allow aggregates to be components of other aggregates, on the other hand,
allow for dynamic objects of arbitrary length and depth, and thus recursive structures are
representable in a more direct fashion. The method of programming in the above two environ-
ments is also affected by this distinction. Value based languages are functional in
approach, objects constructed via calls to functions; while pointer based languages are more
dependent upon the side effects of the assignment statement and parameter modifications.
Before discussing the situation in SETL we present three languages: Ada and PL/I which,
though both are pointer based, have totally different philosophies concerning the use of
pointers in a high-level programming language; and pure LISP which is value based.



1.2.1. Pointers

Pointers have two general areas of application: they can be used to achieve a form of
aliasing or overlaying of two variables (most notably of different data types); and they are
also used to maintain references to dynamically allocated objects. The primary discomfort
expressed by language designers with respect to pointers is in the first area of application.
Overlaying allows the programmer to bypass any type protection provided. Pointers as place-
holders to dynamic objects do have the problem of dangling references, but few would argue
to eliminate dynamic objects on these grounds.

Efforts have been made to allow the use of pointers for dynamic variables while at the
same time prohibiting their use in overlaying variables of different types. When used for
overlaying, pointers are viewed as addresses and thus may be assigned any value within the
range of the address space. No thought is given to the type structure of the program. When it
allows pointers to be manipulated in such a fashion, the language normally provides some
function (e.g. ADDR in PL/I) that accepts a variable as its input and returns the underlying
address of that variables as its result. Note that at this point, there are now two ways of
referencing the variable: by its static (or declared identifier), and secondly, via its internal
address. The absence of such a function in a programming language prevents the programmer
from mixing static and dynamic references.

When pointers are to be used to maintain references to dynamically created objects,
their function becomes one of a higher level than that of machine addresses. When a value is
created at run-time, it is nameless in the sense that it has no identifier associated with it and
thus, the reference to it returned by the allocation function must be saved to allow access to
this new object. Whether the reference to the object is an address or something else entirely is
of no interest, what does concern us is that it is uniquely associated with the object just allo-
cated.



1.2.2. Recursive Structures in PL/I

Of the pointer-oriented languages we shall examine, PL/I [PL/176] has the loosest type
structure with regard to pointer references. Its pointers are said to be untyped, i.e. pointers
are merely viewed as machine addresses without any regard to the logical type structure
imposed by the declarations upon the corresponding referenced objects. Pointers are simply
declared as such and can reference any type of entity in the language. One can additionally
declare a variable as BASED on some pointer, with the effect that that variable is then an
alias for any object the pointer is currently referencing. Variable overlaying is then easily
accomplished as in:

DECLARE X POINTER
Y CHARACTER (4),
Z FIXED BINARY BASED (X);

X = ADDR(Y);
X-> Z = 15;
PUT LIST (Y);

The above is a violation of whatever type restrictions PL/I does possess in that it allows the
assignment of an integer into an object declared as a character string. One saving grace in
favor of PL/I is that it does prohibit arbitrary arithmetic to be directly performed upon
pointers.

It is clear from the above that there is no means by which the type of a referenced vari-
able can be determined in PL/I without additional software support on the part of the
applications programm

One advantage of this scheme is to allow the construction of heterogeneous linked struc-
tures but at the cost of a total inability to perform any type-checking on pointer-manipulated
variables. A related problem, alluded to above, is the inability of the programmer to deter-
mine what form of object the current pointer being manipulated is referencing. As a result,
what one often encounters in programs dealing with such structures is a LISP-like system,
with separate list and atomic nodes, the list nodes being identical in format and containing



type descriptors for the atomic nodes they point to.

1.2.3. Recursive Structures in Ada

Ada [Ada83{, following PASCAL [\Virt7l], is strongly typed with respect to its pointers
(called access types in Ada). When declaring an access object, the type of object being
accessed must also be specified. Pointers are in this manner partitioned as to their use: vari-
ables used to reference one type of object may not be used to reference another type. In
order to achieve the effect of a nonhomogeneous structure, variant records must be employed.
Thus, Pascal and Ada have transformed the pointer into the higher level object discussed
above, one whose purpose is to reference dynamically allocated objects. It is for this reason
that the designers have chosen to call such an entity an access type, rather than using the
term pointer which has a machine address connotation associated with it. As an example, we
present the declaration of a recursive type in Ada [Ada83 pg 3-41]. A record type, CELL, is
declared, to represent a typical node of a doubly linked list. An access type, LINK, is also
declared, which can point to objects of type CELL:

type CELL;

type LINK is access CELL;

type CELL is
record

VALUE : INTEGER;
SUCC :LINK;
PRED :LINK;
end record;

The first declaration of CELL is known as an incomplete type declaration and allows the
declaration of LINK to specify the type of object it may point to.

1.2.4. Recursive Structures in LISP

In both of the above languages, the programmer, in order to construct and manipulate a
recursive structure, must deal directly with the underlying representation, i.e. it becomes his
responsibility to manipulate the pointers linking the structure together. It is far more



desirable for the programmer to ignore such details and instead concentrate upon the abstract
properties of the structure used. Pure LISP [McCa65] allows the programmer such facility. A
list is viewed as a sequence of items, some of them lists themselves. High-level operations are
provided to allow the programmer to manipulate such lists at their conceptual level, e.g. to
insert an element into the list, concatenate two lists together, etc. The fact that internally
such structures are maintained in linked format for efficiency reasons is of no concern to the
programmer.

The primitive data structure in LISP is the linked list, itself a recursive structure. Trees
and more general structures can be easily realized by allowing the elements of a list to be
themselves lists. Such constructs are viewed in LISP as being single entities, rather than the
linked structures of their counterparts in pointer languages. Indeed, the list structure of LISP
is maintained as a binary tree with head (CAR) and tail (CDR) selectors. Such structures vary
their shape and size dynamically allowing for a convenient and flexible manner in which to
program recursive data structures

In addition, pure LISP is value-oriented, i.e. objects are never shared. This implies that
whenever modification to one structure might affect the value of another, a copy needs to be
made to ensure the system's integrity.

In such a system, the problems associated with programming recursive structures in
pointer-oriented languages disappear [Hoar75] Allocation and deallocation of storage is no
longer the concern of the programmer, but rather becomes the responsibility of the language
processor. Side effects to other structures are no longer possible. Dangling references (i.e.
pointers to deallocated storage areas) cannot occur.

If pure LISP is extended with the functions SET, REPLACA and REPLACD, which
allow explicit pointer modification of the CAR and CDR values of a list, LISP loses its func-
tional value. Using these functions, it is possible to make a change in one structure and in
doing so change the values of other structures. Using these selective updating operations, cir-



8

cular structures can be created, and true object sharing can be achieved introducing side-
effects and argument modification. It is for this reason that REPLACA and REPLACD are
normally used for special-purpose programming only (e.g. garbage collection procedures).

1.2.5. Recursive Structures in SETL

SETL allows for both implementations of recursive structures: pointer and value
oriented. While essentially a value-semantics language, the flavor of pointers can be gotten
through use of the atom data type as in the abstract, pointers establish an explicit mapping
between a tag and some corresponding value. In this role whether the pointer is implemented
as a physical address or not is irrelevant. The atom data type of SETL is akin to the LISP
gensym, in that each invocation of its generator (newat in SETL) yields a name distinct
from all others in the program. These can then be used as domain elements of maps to pro-
vide the same effect as that of pointers. As an example, a binary tree can be represented in
SETL, using the data representation sublanguage by the following three maps:

LEFT : map (ATOM) ATOM;
INFO : map (ATOM) INFO_TYPE;
RIGHT: map (ATOM) ATOM;

where map (d) r denotes a map with domain d and range r. LEFT and RIGHT in this exam-
ple play the roles normally assumed by pointers.

In addition, SETL also allows for the high-level form of recursive structure representa-
tion via arbitrarily nested tuples. Using this approach, each node of a binary tree, for exam-
ple, can be represented by a tuple, of length 3, whose first value is the left subtree, itself
represented in the above fashion; the second element being the information contained in this
node, and the last element representing the right subtree:

tree: tuple(tree, SOME_TYPE, tree)

SETL, however, does not provide any facility for selective updating, i.e. partial
modification of a structure, in the manner that LISP allows with REPLACA and REPLACD.



Thus, the statement:

x(l):=x;
has the same effect as:

x := [x, x(2), x(3), ..., x(#x)]

and thus no circular structures or side-effects can occur when programming in this manner.

The above two representations are characteristic of two diametrically opposed ways of
programming in SETL. The second, using nested tuples, reminiscent of LISP, takes a func-
tional approach, in which new copies of the structure are created for any modifications that
are made to the structure. A typical algorithm traversing such a structure will extract com-
ponents of the structure when traversing it, and upon unwinding the recursion reforms these
components into new structures. As an example, consider the following fragment that inserts
an element x into a binary search tree:

procedure insert (tree, x);

$ Handling of leaf cases

[val, left, right] := tree;
if x < val then

left := insert(left, x);
else

right := insert (right, x);
end if;

return val, left, rightj;
end procedure:

in the situation where the node being examined is not a leaf, requiring a traversal further
down the tree (i.e. inserting the new element into either the left or right subtree), the three
components of the current node are extracted. The appropriate child (left or right) is
traversed. Upon returning from the recursive call to insert, a new tuple is formed, consisting
of the data element of this node, and the two children, one of them modified.

The first method, employing the maps LEFT and RIGHT as successor functions, is
characteristic of programming in a more conventional language, one in which pointers are



10

readily available, the major distinction being the use of logical atoms rather than physical
pointers. Modifications are made to the structure in this model, not by creating a new
updated copy of it, but rather by making changes to the domains or ranges of the maps. As
an example, a typical routine to insert a value into a tree represented by the three maps
LEFT, INFO and RIGHT looks like:

proc insertftree, x);

if tree = om then $ Empty tree

z := newat;
INFO(z) :=x;
tree := z;
return;
elseif x < INFO(tree) then $ Left insertion
if LEFT(tree) = om then $ Leaf node

z := newat;

INFO(z) := x;

LEFT(tree) := z;

return;
else $ must go down further

insert(LEFT(tree), x);
end if LEFT(tree);
else $ right insertion

if RIGHT(tree) = om then $ Leaf node

z := newat;

INFO(z) := x;

RIGHT(tree) := z;

return;
else $ must go down further

insert (RIGHT(tree), x);
end if RIGHT(tree);
end if x < INFO(tree);
end proc insert;

An interesting observation is that SETL is closer to FORTRAN or Algol 60 in this area
than to Pascal or PL/I, in that the data structure is 'flattened' into one map per field of all
the nodes, rather than a set of records, each containing only the fields pertaining to one node.
This is due to SETL's (like FORTRAN and Algol) lack of a record type.

Upon examination of these two modes of programming recursive data structures, it
becomes obvious that the pointer-oriented method is the more efficient. This is because less
run-time allocation need be performed as successive copies need not be created. If nested



11


1 3 4 5 6 7 8 9

Online LibraryGerald WeissRecursive data types in SETL: automatic determination, data language description, and efficient implementation → online text (page 1 of 9)