Kawa: Compiling Scheme to Java

Introduction

While Java is a decent programming language, its success is largely due to using portable bytecodes, and its integration into secure web browsers. Many prefer other programming languages. If they are implemented on top of Java, programmers can gain many of the extra-linguistic benefits of Java, including libraries, portable bytecodes, web applets, and the existing efforts to improve Java implementations and tools.

Scheme is a simple yet powerful language. It is a non-pure functional language (i.e. it has first-class functions, lexical scoping, non-lazy evaluation, and side effects). It has dynamic typing, and usually has an interactive read-evaluate-print interface. The dynamic nature of Scheme (run-time typing, immediate expression evaluation) may be a better match for the dynamic Java environment (interpreted bytecodes, dynamic loading) than Java is!

Note: Kawa is also the trademarked name of a commercial Java development environment. To avoid confusion, we may soon change the name of the system described here, but in any case, information and source code will be available from `http://www.cygnus.com/~bothner/kawa.html'.

Background

Starting in 1995 Cygnus (on behalf of the Free Software Foundation) developed Guile, an implementation of Scheme suitable as a general embedding and extension language. Guile was based on Aubrey Jaffar's SCM interpreter; the various Guile enhancements were initially done by Tom Lord. In 1995 we got a major contract to enhance Guile, and with our client we added more features, including threads (primarily done by Anthony Green), and internationalization.

The contract called for a byte-code compiler for Guile, and it looked like doing a good job on this would be a major project. One option we considered was compiling Scheme into Java bytecodes and executing them by a Java engine. The disadvantage would be that such a Scheme system would not co-exist with Guile (on the other hand, we had run into various technical and non-technical problems with Guile that led us to conclude that Guile would after all not be strategic to Cygnus). The advantage of a Java solution was leveraging off the tools and development being done in the Java "space", plus that Java was more likely to be strategic long-term.

The customer agreed to using Java, and I started active development June 1996. Kawa 1.0 was released to our customer and "the Net" September 1996. Development has continued since then, at a less intense pace! The long-term goal is an object-oriented environment that harmoniously integrates Scheme, Java, and other languages.

As a base, I used the Kawa Scheme interpreter written by R. Alexander Milowski. He needed an object-oriented Scheme interpreter to implement DSSSL International Standard ISO/IEC 10179:1996(E), Document Style Semantics and Specification Language. 1996, a Scheme-like environment for expressing style, formatting, and other processing of SGML documents International Standards Organization: Standard Generalized Markup Language, ISO 8879. DSSSL is an subset of "pure" Scheme with some extensions. Kawa 0.2 was a simple interpreter which was far from complete. It provided a useful starting point, but almost all of the original code has by now been re-written.

There are three basic ways of implementing Scheme in Java.

One could write an interpreter in Java, but this has poor performance.
One could translate into Java source code. However, invoking an external Java source-to-bytecode compiler is too slow except for batch compilation. Also, some features of Java bytecodes (such as goto) are not available in the Java language.
One could translate into Java bytecodes, either ahead of time (a batch compiler), or for immediate execution.

We have implemented the last option for Kawa, partly because that was what our contract called for, but also because it is the best solution. To generate bytecodes, we use a new "codegen" package, which is an intermediate-level toolkit to generate Java bytecodes and ".class" files.

We will discuss the compiler later, but first we will give an overview of the run-time environment of Kawa, and the classes used to implement Scheme values.

Objects and Values

Java J. Gosling, B. Joy, G.~Steele: The Java Language Specification, Addison-Wesley, 1996 has primitive types (such as 32-bit int) as well reference types. If a variable has a reference type, it means that it can contains references (essentially pointers) to objects of a class, or it can contain references to objects of classes that "extend" (inherit from) the named class. The inheritance graph is "rooted" (like Smalltalk and unlike C++); this means that all classes inherit from a distinguished class java.lang.Object (or just Object for short).

Standard Scheme W. Clinger and J. Rees (editors): Revised 4 report on the algorithmic language Scheme. LISP Pointers, 4(3):1--55, 1991. has a fixed set of types, with no way of creating new types. It has run-time typing, which means that types are not declared, and a variable can contain values of different types at different times. The most natural type of a Java variable that can contain any Scheme value is therefore Object, and all Scheme values must be implemented using some class that inherits from Object.

The task then is to map each Scheme type into a Java class. Whether to use a standard Java class, or to write our own is a tradeoff. Using standard Java classes simplifies the passing of values between Scheme functions and existing Java methods. On the other hand, even when Java has suitable built-in classes, they usually lack functionality needed for Scheme, or are not organized in any kind of class hierarchy as in Smalltalk or Dylan. Since Java lacks standard classes corresponding to pairs, symbols, or procedures, we have to write some new classes, so we might as well write new classes whenever the existing classes lack functionality. One extreme would be to define a new SchemeObject class, and derive from it classes for all Scheme values. While this might make implementing Scheme easier, Kawa does not go that far, partly because we want to allow Scheme variables to contain arbitrary Java objects.

The Scheme boolean type is one where we use a standard Java type, in this case Boolean (strictly speaking java.lang.Boolean). The Scheme constants `#f' and `#t' are mapped into static fields (i.e. constants) Scheme.falseObject and Scheme.trueObject. These are initialized to new Boolean(false) and new Boolean(true), respectively, when Kawa starts up.

On the other hand, numbers and collections are reasonably organized into class hierarchies, which Java does not do well. So Kawa has its own classes for those, discussed in the following sections.

Symbols

Symbols represent names, and do not need much functionality. Scheme needs to be able to convert them to and from strings, and they need to be "interned" (which means that there is a global table to ensure that there is a unique symbol for a given string). Symbols are immutable and have no accessible internal structure. Currently, Scheme symbols are implemented using a Kawa-specific Symbol class.

I am planning to re-implement symbols to use the the standard Java String class, which supports an intern operator. Note that the Java String class implements immutable strings, and is therefore cannot be used to implement Scheme strings. However, it makes sense to use it to implement symbols, since the way Scheme symbols are used is very similar to how Java Strings are used. While Kawa currently assumes a Java implementation corresponding to version 1.0 of JDK (Sun's Java Development Kit), a future version will depend on JDK 1.1 features. One such new 1.1 feature is that literal Java Strings will be required to be automatically interned. This will make it even more appealing to use String to implement Scheme symbols.

Numbers

Scheme defines a "numerical tower" of numerical types: number, complex, real, rational, and integer. Kawa implements the full "tower" of Scheme number types, which are all sub-classes of the abstract class Quantity discussed in section Quantities.

class Complex extends Quantity {
  public abstract RealNum re ();
  public abstract RealNum im ();
  ...
}

Complex is the class of abstract complex numbers. It has three subclasses: the abstract class RealNum of real numbers; the general class CComplex where the components are arbitrary RealNum fields; and the optimized DComplex where the components are represented by double fields.

class RealNum extends Complex {
  public final RealNum re ()
  { return this; }
  public final RealNum im ()
  { return IntNum.zero(); }
  public abstract boolean isNegative ();
  ...
}

class DFloNum extends RealNum {
  double value;
  ...
}

Concrete class for double-precision (64-bit) floating-point real numbers.

class RatNum extends RealNum {
  public abstract IntNum numerator();
  public abstract IntNum denominator();
  ...
}

RatNum, the abstract class for exact rational numbers, has two sub-classes: IntFraction and IntNum.

class IntFraction extends RatNum {
  IntNum num;
  IntNum den;
  ...
}

The IntFraction class implements fractions in the obvious way. Exact real infinities are identified with the fractions 1/0 and -1/0.

class IntNum extends RatNum {
  int ival;
  int[] words;
  ...
}

The IntNum concrete class implements infinite-precision integers. The value is stored in the first ival elements of words, in 2's complement form (with the low-order bits in word[0]).

There are already many bignum packages, including a couple written in Java. What are the advantages of this one?

A complete set of operations, including gcd and lcm; logical, bit, and shift operations; power by repeated squaring; all of the division modes from Common Lisp (floor, ceiling, truncate, and round); and exact conversion to double.
Consistency and integration with a complete "numerical tower." Specifically, consistency and integration with "fixnum" (see below).
Most bignum packages use a signed-magnitude representation, while Kawa uses 2's complement. This makes for easier integration with fixnums, and also makes it cheap to implement logical and bit-fiddling operations.
Use of all 32 bits of each "big-digit" word, which is the "expected" space-efficient representation. More importantly, it is compatible with the mpn routines from the Gnu Multi-Precision library (gmp by T. Granlund). The mpn routines are low-level algorithms that work on unsigned pre-allocated bignums; they have been transcribed into Java in the MPN class. If better efficiency is desired, it is straight-forward to replace the MPN methods with native ones that call the highly-optimized mpn functions.

If the integer value fits within a signed 32-bit int, then it is stored in ival and words is null. This avoids the need for extra memory allocation for the words array, and also allows us to special-case the common case.

As a further optimization, the integers in the range -100 to 1024 are pre-allocated.

Mixed-type arithmetic

Many operations are overloaded to have different definitions depending on the argument types. The classic examples are the functions of arithmetic such as `+', which needs to use different algorithms depending on the argument types. If there is a fixed and reasonably small set of number types (as is the case with standard Scheme), then we can just enumerate each possibility. However, the Kawa system is meant to be more extensible and support adding new number types.

The solution is straight-forward in the case of a one-operand function such as "negate", since we can use method overriding and virtual method calls to dynamically select the correct method. However, it is more difficult in the case of a binary method like `+', since classic object-oriented languages (including Java) only support dynamic method selection using the type of the first argument ("this"). Common Lisp and some Scheme dialects support dynamic method selection using all the arguments, and in fact the problem of binary arithmetic operations is probably the most obvious example where "multi-dispatch" is useful.

Since Java does not have multi-dispatch, we have to solve the problem in other ways. Smalltalk has the same problems, and solved it using "coercive generality": Each number class has a generality number, and operands of lower generality are converted to the class with the higher generality. This is inefficient because of all the conversions and temporary objects (see T. A. Budd: Gneralized arithmetic in C++. Journal of Object-Oriented Programming, 3(6):11--22, Feb. 1991, and it is limited to what extent you can add new kinds of number types.

In "double dispatch" D. Ingalls: A simple technique for handling multiple polymorphism. ACM SIGPLAN Notices, 21(11):347--349, Nov. 1986., the expression x-y is implemented as x.sub(y). Assuming the (run-time) class of x is Tx and that of y is Ty, this causes the sub method defined in Tx to be invoked, which just does y.subTx(x). That invokes the subTx method defined in Ty which can without further testing do the subtraction for types Tx and Ty.

The problem with this approach is that it is difficult to add a new Tz class, since you have to also add subTz methods in all the existing number classes, not to mention addTz and all the other operations.

In Kawa, x-y is also implemented by x.sub(y). The sub method of Tx checks if Ty is one of the types it knows how to handle. If so, it does the subtraction and returns the result itself. Otherwise, Tx.sub does y.sub_reversed(x). This invokes Ty.sub_reversed (or sub_reversed as defined in a super-class of Ty). Now Ty (or one of its super-classes) gets a chance to see if it knows how to subtract itself from a Tx object.

The advantage of this scheme is flexibility. The knowledge of how to handle a binary operation for types Tx and Ty can be in either of Tx or Ty or either of their super-classes. This makes is easier to add new classes without having to modify existing ones.

Quantities

The DSSSL language is a dialect of Scheme used to process SGML documents. DSSSL has "quantities" in addition to real and integer numbers. Since DSSSL is used to format documents, it provides length values that are a multiple of a meter (e.g. 0.2m), as well as derived units like cm and pt (point). A DSSSL quantity is a product of a dimension-less number with an integral power of a length unit (the meter). A (pure) number is a quantity where the length power is zero.

For Kawa, I wanted to merge the Scheme number types with the DSSSL number types, and also generalize the DSSSL quantities to support other dimensions (such as mass and time) and units (such as kg and seconds). Quantities are implemented by the abstract class Quantity. A quantity is a product of a Unit and a pure number. The number can be an arbitrary complex number.

class Quantity extends Number {
  public Unit unit ()
  { return Unit.Empty; }
  public abstract Complex number ();
  ...
}

class CQuantity extends Quantity {
  Complex num;
  Unit unt;
  public Complex number ()
  { return num; }
  public Unit unit ()
  { return unt; }
  ...
}

A CQuantity is a concrete class that implements general Quantities. But usually we don't need that much generality, and instead use DQuanity.

class DQuantity extends Quantity {
  double factor;
  Unit unt;
  public final Unit unit ()
  { return unt; }
  public final Complex number ()
  { return new DFloNum(factor); }
  ...
}

class Unit extends Quantity {
  String name; // Optional.
  Dimensions dims;
  double factor;
  ...
}

A Unit is a product of a floating-point factor and one or more primitive units, combined into a Dimensions object. The Unit name have a name (such as "kg"), which is used for printing, and when parsing literals.

class BaseUnit extends Unit {
  int index;
  ...
}

A BaseUnit is a primitive unit that is not defined in terms of any other Unit, for example the meter. Each BaseUnit has a different index, which is used for identification and comparison purposes. Two BaseUnits have the same index if and only if they are the same BaseUnit.

class Dimensions {
  BaseUnit[] bases;
  short[] powers;
  ...
}

A Dimensions object is a product and/or ratio of BaseUnits. You can think of it as a data structure that maps every BaseUnit to an integer power. The bases array is a list of the BaseUnits that have a non-zero power, in order of the index of the BaseUnit. The powers array gives the power (exponent) of the BaseUnit that has the same index in the bases array.

Two Dimensions objects are equal if they have the same list of bases and powers. Dimensions objects are "interned" (using a global hash table) so that they are equal only if they are the same object. This makes it easy to implement addition and subtraction:

  public static DQuantity add (DQuantity x, DQuantity y)
  {
    if (x.unit().dims != y.unit().dims)
      throw new ArithmeticException ("units mis-match");
    double r = y.unit().factor / x.unit().factor;
    double s = x.factor + r * y.factor;
    return new DQuantity (s, x.unit());
  }

The Unit of the result of an addition or subtraction is the Unit of the first operand. This makes it easy to convert units:

kawa> (+ 0cm 2.5m)
250cm

Because Kawa represents quantities relative to user-specified units, instead of representing them relative to primitive base units, it can automatically print quantities using the user's preferred units. However, this does make multiplication and division more difficult. The actual calculation (finding the right Dimensions and multiplying the constant factors) is straight-forward. The problem is generating the new compound unit, and later printing out the result in a human-friendly format. There is no obvious right way to do this. Kawa creates a MulUnit to represent a compound unit, but it is not obvious which simplifications should be done when. Kawa uses a few heuristics to simplify compound units, but this is an area that could be improved.

Collections

Kawa has a rudimentary hierarchy of collection classes.

class Sequence {
  abstract public int length();
  abstract public Object elementAt (int index);
  ...
}

A Sequence is the abstract class that includes lists, vectors, and strings.

class FString extends Sequence {
  char[] value;
  ...
}

Used to implement fixed-length mutable strings (array of Unicode character). This is used to represent Scheme strings.

class FVector extends Sequence {
  Object[] value;
  ...
}

Used to implement fixed-length mutable general one-dimensional array of Object. This is used to represent Scheme vectors.

class List extends Sequence {
  protected List () { }
  static public List Empty = new List ();
  ...
}

Used to represent Scheme (linked) lists. The empty list '() is the static (global) value List.Empty. Non-empty-lists are implemented using Pair objects.

class Pair extends Sequence {
  public Object car;
  public Object cdr;
  ...
}

Used for Scheme pairs.

class PairWithPosition extends Pair {
  ...
}

Like Pair, but includes the filename and linenumber in the file from which the pair was read.

Future plans include more interesting collection classes, such a sequences implemented as a seekable disk file; lazily evaluated sequences; hash tables; APL-style multi-dimensional arrays; stretchy buffers.(1)

Procedures

Scheme has procedures as first-class values. Java does not. However, we can simulate procedure values, by overriding of virtual methods.

class Procedure {
  public abstract Object applyN (Object[] args);
  public abstract Object apply0 ();
  public abstract Object apply1 (Object arg1);
  ...;
  public abstract Object apply4 (Object arg1, ..., Object arg4);
  ...
}

We represent Scheme procedures using sub-classes of the abstract class Procedure. To call (apply) a procedure with no arguments, you invoke its apply0 method; to invoke a procedure, passing it a single argument, you use its apply1 method; and so on using apply4 if you have 4 arguments. Alternatively, you can bundle up all the arguments into an array, and use the applyN method. If you have more than 4 arguments, you have to use applyN.

Notice that all Procedure sub-classes have to implement all 6 methods, at least to the extent of throwing an exception if it is passed the wrong number of arguments. However, there are utility classes Procedure0 to Procedure4 and ProcedureN.

class Procedure1 extends Procedure {
  public abstract Object applyN (Object[] args)
  {
    if (args.length != 1)
      throw new WrongArguments();
    return apply1(args[0]);
  }
  public Object apply0 ()
   { throw new WrongArguments(); }
  public abstract Object apply1 (Object arg1);
  public Object apply2 (Object arg1, Object arg2)
  { throw new WrongArguments();}
  ...
}

Primitive procedures are generally written in Java as sub-classes of these helper classes. For example:

class cdr extends Procedure1 {
  public Object apply1 (Object arg1)
   { return ((Pair) arg1).cdr; }
}

A user-defined Scheme procedure is compiled to a class that is descended from Procedure. For example, a variable-argument procedure is implemented as a sub-class of ProcedureN, with an applyN method comprising the bytecode compiled from the Scheme procedure body. Thus primitive and user-defined procedure have the same calling convention.

If a nested procedure references a lexical variable in an outer procedure, the inner procedure is implemented by a "closure". Kawa implements a closure as a Procedure object with a "static link" field that points to the inherited environment. In that case the lexical variable must be heap allocated, but otherwise lexical variables use local Java variable slots. (This is conceptually similar to the "Inner classes" proposed for JDK 1.1.)

class ModuleBody extends Procedure0 {
  public Object apply0 ()
  {return run(Environment.current()); }
  public abstract Object run (Environment env);
  ...
}

Top-level forms (including top-level definitions) are treated as if they were nested inside a dummy procedure. A ModuleBody is such a dummy procedure. When a file is loaded, the result is a ModuleBody; invoking run causes the top-level actions to be executed.

Overview of compilation

These are the stages of compilation:

Reading: The first compilation stage reads the input from a file, from a string, or from the interactive command interpreter. The result is one or more Scheme forms (S-expressions), usually lists. If reading commands interactively, only a single form is read; if reading from a file or string, all the forms are read until end-of-file or end-of-string; in either case, the result is treated as the body of a dummy function (i.e. a ModuleBody).
Translation: The source form body is rewritten into an Expression object, specifically a ModuleExp. This stage handles macro expansion and lexical name binding. Many (future) optimizations can be done in this phase by annotating and re-arranging Expression trees.
Code generation: The resulting ModuleExp is compiled into one or more byte-coded classes. This is done by invoking the virtual compile method recursively on the Expressions, which generates instructions (using the codegen package) to evaluate the expression and leave the result on the Java operand stack. At the end we ask the codegen package to write out the resulting classes and methods. They can be written to a file (for future use), or into byte arrays in memory.
Loading: The compiled bytecodes are loaded into the Kawa interpreter. In the case of code that is compiled and then immediately executed, the compiled code can be immediately turned into Java classes using the Java ClassLoader feature. (That is how the read-eval-print loop works.) An instance of the compiled sub-class of ModuleBody is created and run, which normally produces various side-effects.

Top-level environments

class Environment {
  ...
}

An Environment is a mapping from symbols to bindings. It contains the bindings of the user top-level. There can be multiple top-level Environments, and an Environment can be defined as an extension of an existing Environment. The latter feature is used to implement the various standard environment arguments that can be passed to eval, as adopted for the next Scheme standard revision ("R5RS"). Nested environments were also implemented to support threads, and fluid bindings (even in the presence of threads).

Expressions

class Expression {
  public abstract Object eval (Environment env);
  public abstract void compile (Compilation comp, int flags);
  ...
}

The abstract Expression class represents partially processed expressions. These are in principle independent of the source language, but in practice there are some Scheme assumptions.

The eval method evaluates the Expression in the given Environment. The interactive command interface uses eval to evaluate expressions typed by the user. However, eval only support "simple" expressions, such as literals, identifiers, and applications. Expressions that define new local bindings (such lambda expressions and let forms) do not implement eval. If the user types in such an expression, it is wrapped inside a dummy function, compiled, and immediately executed. This is to avoid dealing with lexical binding in the evaluator. (We could compile all user expressions, but that entails a certain amount of overhead. Code generation creates new classes, and JDK 1.0 does not garbage-collect unused classes.)

The compile method is called when we are compiling the body of a procedure. It is responsible for generating bytecodes that evaluate the expression, and leave the result on the Java evaluation stack.

class QuoteExp extends Expression {
Object value;
  public QuoteExp (Object val)
  { value = val; }
  public Object eval (Environment env)
  { return value; }
  public void compile (Compilation comp, int flags)
  { comp.compileConstant (value); }
  ...
}

A QuoteExp represents a literal (self-evaluating form), or a quoted form.

class ReferenceExp extends Expression {
  Symbol symbol;
  Declaration binding;
  ...
}

A ReferenceExp is a reference to a named variable. The symbol is the source form identifier. If binding is non-null, it is the lexical binding of the identifier.

class ApplyExp extends Expression {
  Expression func;
  Expression[] args;
  ...
}

An ApplyExp is an application of a procedure func to an argument list args.

class ScopeExp extends Expression {
  ScopeExp outer; // Surrounding scope.
  public Declaration add_decl (Symbol name)
  { Create new local variable. }
...
}

A ScopeExp is a abstract class that represents a lexical scoping construct. Concrete sub-classes are LetExp (used for a let binding form) and LambdaExp.

class LambdaExp extends ScopeExp {
  Symbol name; // Optional.
  Expression body;
  int min_args;
  int max_args;
  ...
}

The Scheme primitive syntax lambda is translated into a LambdaExp, which represents anonymous procedures. Each LambdaExp is compiled into a different bytecoded class. Invoking eval causes the LambdaExp to be compiled into a class, the class to be loaded, and instance of the class to be created, and the result coerced to a Procedure.

Other sub-classes of Expression are IfExp (used for conditional expressions); BeginExp (used for compound expressions); SetExp (used for assignments); and ErrorExp (used to mark code that has a syntax error);

Translation

The translation phase takes a top-level form (or body), and generates a ModuleExp, which is a top-level expression. This is done using a Translator, which keeps track of lexical bindings and other translation state.

class Translator {
  public Expression rewrite (Object exp)
  { ... }
  public Expression syntaxError (String message)
  { ... }
  ...
}

The rewrite method converts a Scheme source form to an Expression. The syntaxError method is called when a syntax error is seen. It prints out the current source filename and line number with the given message.

A ModuleExp represents a top-level form.

class ModuleExp extends LambdaExp {
  public Object eval_module (Environment env)
  {
    if (body_is_simple) // Optimization
      return body.eval (env);
    Object v = eval (env);
    return ((ModuleBody) v).run (env);
  }
  ...
}

ModuleExp is a sub-class of LambdaExp, since it is actually a dummy function created by wrapping the top-level forms in an implicit lambda. The eval_module method evaluates the top-level forms. It invokes the eval in LambdaExp (which invokes the compiler). The result of eval is a ModuleBody (see section Procedures), which we can run. If the body is "simple" we don't bother actually evaluating the ModuleExp, since that entails compiling it to a bytecoded class; we just eval the body instead.

Syntax and Macros

class Syntax {
  public abstract Expression rewrite (Object obj, Translator tr);
  ...
}

The rewrite method in Translator checks for syntactic keywords and macros. If the car of a "call" is a Syntax or if it is a Symbol that is bound to a Syntax, then its rewrite method is called.

As an example, this trivial class implements quote.

class quote extends Syntax {
  public Expression rewrite (Object obj, Translator tr)
  {
    // Error-checking is left out.
    Object value = ((Pair)obj).car;
    return new QuoteExp(value);
  }
  ...
}

Much more complicated is the Syntax that implements define-syntax.

class define_syntax extends Syntax {
  public Expression rewrite (Object obj, Translator tr)
  {
    enter (new SyntaxRules (...));
  }
  ...
}

The result is a SyntaxRules object, which contains an encoded representation of the patterns and templates in the syntax-rules. This is in its own right a Syntax object.

class SyntaxRules extends Syntax {
SyntaxRule[] rules;
  public Expression rewrite (Object obj, Translator tr)
  {
    Object[] v = new Object[maxVars];
    for (int i = 0;  i < rules.length;)
      {
        SyntaxRule r = rules[i++];
        if (r.match (obj, v))
          return r.execute_template(v, tr);
      }
    return tr.syntaxError ("no matching syntax-rule");
  }
  ...
}

Contrast evaluating a procedure definition (lambda), which causes a new sub-class of Procedure to be created and compiled, while evaluating a define-syntax only causes a new instance of SyntaxRules to be created. This is because the syntax-rules can be represented using relatively simple and compact data structures. A traditional low-level macro facility specifies the transformations using executable code, and that probably would need a new Procedure sub-class.

Code generation

A Compilation object keeps track of the classes, methods, and temporary state generated as a result of compiling a single top-level ModuleExp.

class Compilation {
  ClassType[] classes;
  boolean immediate;
  public ClassType addClass (LambdaExp lexp, String name)
  { ... }
  public ClassType (ModuleExp exp, ...)
  { ...; addClass (exp, ...); }
  ...
}

Each Compilation may create one or more ClassType objects, each of which generates the bytecodes for one class. Each ClassType is generated from a LambdaExp, including the top ModuleExp. The boolean immediate is true if we are compiling for immediate loading, and is false if the target is one or more .class files.

The addClass method does all the work to compile a given LambdaExp. It creates a ClassType, adds it to Compilation's classes array, and generates Method objects for the constructor and the main applyX method. Once the applyX Method has been created, addClass emits some bytecodes to set up the incoming parameters, and then invokes the virtual compile method on the body of the LambdaExp, which generates the code that does the actual work of the procedure.

The Compilation constructor gets a ModuleExp, which it passes to addClass. The compile method of LambdaExp (which gets called for all lambdas except the dummy top-level) also calls addClass to generate the class corresponding to the lambda, and then it emits instructions to create a new instance of the generated Procedure class, and pushes it on the Java stack.

The `codegen` package

The ClassType and Method classes are in a separate codegen package, which is an intermediate-level interface to code generation and Java .class files. It is essentially independent of Scheme or the rest of Kawa, and could be used for generating code for other languages.

class ClassType extends Type {
  CpoolEntry[] constant_pool;
  Method methods; // List of methods.
  Field fields; // List of fields.
  public Field new_field (String name, Type type, int flags)
  { Create new field. }
  public method new_method (String name, ...)
  { Create new method. }
  public void emit_to_stream (OutputStream stream)
  { ... }
  public void emit_to_file (String filename)
  { ... }
  public byte[] emit_to_array ()
  { ... }
  ...
}

The ClassType class is the main class of the codegen package. It manages a list Fields , a list of Methods, and the constant pool. There are utility methods for adding new fields, methods, and constant pool entries.

When the ClassType has been fully built, the emit_to_file method can be used to write out the contents into a file. The result has the format of a .class file T. Lindholm and F.~Yellin: The Java Virtual Machine Specification, Addison-Wesley, 1996. Alternatively, the class can be written to an internal byte array (that has the same layout as a .class file) using the emit_to_array method. The resulting byte array may be used by a ClassLoader to define a new class for immediate execution. Both of the these methods are implemented on top of the more general emit_to_stream.

The largest class in the codegen package is Method.

class Method {
  Variable new_local (Type type, String name)
  { ... }
  public void compile_push_value (Variable var)
  { ... }
  public void compile_push_int (int i)
  { ... }
  public void compile_linenumber (int linenumber)
  { ... }
  ...
}

As an example of the level of functionality, compile_push_int compiles code to push an integer i on stack. It selects the right instruction, and if i is too big for one of the instructions that take an inline value, it will create a constant pool entry for i, and push that.

The method new_local creates a new local variable (and makes sure debugging information is emitted for it), while compile_push_value pushes the value of the variable on the stack.

Kawa calls compile_linenumber to indicate that the current location corresponds to a given line number. These are emitted in the .class file, and most Java interpreters will use them when printing a stack trace.

Literals

A Scheme quoted form or self-evaluating form expands to a QuoteExp. Compiling a QuoteExp would seem a trivial exercise, but it is not. There is no way to embed (say) a list literal in Java code. Instead we create a static field in the top-level class for a each (different) QuoteExp in the body we are compiling. The code compiled for a QuoteExp then just needs to load the value from the corresponding static field. The tricky part is making sure that the static field gets initialized (when the top-level class is loaded) to the value of the quoted form.

The basic idea is that for:

(define (foo) '(3 . 4))

we compile:

class foo extends Procedure0 {
  Object static lit1;
  public foo ()
  { // Initializer
    lit1 = new Pair(IntNum.make(3), IntNum.make(4));
  }
  public Object apply0 ()
  { return lit1; }
}

When the compiled class foo is loaded, we do:

Class fooCl = Class.forName("foo");
Procedure fooPr = (Procedure) fooCl.newInstance ();

// Using foo:
Object result = fooPr.apply0 ();

How does the Kawa compiler generate the appropriate new Pair expression as shown above? A class whose instances may appear in a quoted form implements the Compilable interface:

interface Compilable {
  Literal makeLiteral (Compilation comp);
  void emit (Literal literal, Compilation comp);

The makeLiteral creates a Literal object that represents the value of this object. That Literal is later passed to emit, which emits bytecode instructions that (when evaluated) cause a value equal to this to be pushed on the Java evaluation stack.

This two-part protocol may be overkill, but it makes it possible to combine duplicate constants and it also supports circularly defined constants. (Standard Scheme does not support self-referential constants, but Common Lisp does. See section 25.1.4 Similarity of Constants in G. L. Steele Jr.: Common Lisp - The Language, Digital Press and Prentice-Hall, second edition, 1990.

It is likely that the Compilable interface will be replaced in the future with the serialization features of JDK 1.1.

If we are compiling for immediate execution, we do not need to generate code to regenerate the literal. In fact, we want to re-use the literal from the original source form. The problem is passing the source literal to the byte-compiled class. To do that, we use the CompiledProc interface.

interface CompiledProc {
  public abstract void setLiterals (Object[] values);
}

An immediate class compiled from a top-level form implements the CompiledProc form. After an instance of the ModuleBody has been created, it is coerced to a CompiledProc, and setLiterals is called. The argument to setLiterals is an array of the necessary literal values, and the method that implements it in the compiled code causes the array of literal values to be saved in the ModuleBody instance, so it can be accessed by the compiled code.

Low-level procedures

Using a named global Procedure object has a bit of overhead. We refer to it by name, and finding the right Procedure is done by a run time name lookup. Once we have the procedure, we do a virtual method call to apply the procedure, which is relatively fast, but we would like to be able to inline known functions. Also, in the future, we want to support type declarations and the use of machine types such as int. Having to convert an int to an IntNum just because that is what the apply interface requires is expensive.

Another problem is that being able to call a Java method from Scheme requires writing by hand a new Procedure subclass. This is not a friendly interface.

The plan is to provide new syntax for defining Scheme procedures in terms of Java methods. The syntax will look something like:

(define-virtual exec-process "java.lang.RunTime" "exec"
     "java.lang.Process" ("java.lang.String"))

(set! ls (exec-process "ls -l"))

The primitive syntax define-syntax defines a PrimitiveProcedure object (which inherits from PrcedureN), which refers to a named virtual method in a named class with specified parameter and return types. The primitives define-static and define-interface are similar, but for static and interface methods.

When the compile method of ApplyExp sees that the function is a PrimitiveProcedure, it will emit code to call the method. It will also emit code to coerce the arguments to the required types. It will not do any checking that the method actually exists and has declared type, since the method may not be available yet (and it is also a pain to do, though the JDK 1.1 reflection facility will make it easier).

The name declared by define-virtual has the same scope as that of a macro. It must be seen at compile-time. A PrimitiveProcedure cannot be applied or called at top-level interactively.

We can lift these restrictions when Kawa switches to JDK 1.1 and its reflective features. In that case, using a PrimitiveProcedure may be the preferred way to define many builtin functions. The advantage is simpler declarations, fewer standard classes, much faster invocation when the compiler knows which method is being called, but at the cost of somewhat slower calls when we have to use the general applyN interface, which would use the probably slower reflection facilities.

Continuations

Scheme continuations "capture" the current execution state. They can be implemented by copying the stack, but this requires non-portable native code. Kawa continuations are implemented using Java exceptions, and can be used to prematurely exit (throw), but not to implement co-routines (which should use threads anyway).

class callcc extends Procedure1 {
  public Object apply1 (Object arg1)
  {
    Procedure proc = (Procedure) arg1;
    Continuation cont = new Continuation ();
    try { return proc.apply1(cont); }
    catch (CalledContinuation ex)
      {
        if (ex.continuation != cont)
             throw ex;  // Re-throw.
        return ex.value;
      }
  }
  ...
}

This is the Procedure that implements call-with-current-continuation. It creates cont, which is the "current continuation", and passes it to the incoming proc. If callcc catches a CalledContinuation exception it means that proc invoked some Continuation. If it is "our" continuation, return the value passed to the continuation; otherwise re-throw it up the stack until we get a matching handler. (I have left out code to detect unsupported invocation of cont after callcc returns.)

class Continuation extends Procedure1 {
  public Object apply1 (Object arg1)
  {
    throw new CalledContinuation (arg1, this);
  }
  ...
}

A Continuation is the actual continuation object that is passed to callcc's argument; when it is invoked, it throws a CalledContinuation that contains the continuation and the value returned.

class CalledContinuation extends RuntimeException {
  Object value;
  Continuation continuation;
  CalledContinuation (Object value, Continuation continuation)
  {
    super ("call/cc called");
    this.value = value;
    this.continuation = continuation;
  }
  ...
}

CalledContinuation is the exception that is thrown when the continuation is invoked.

Tail-calls

Scheme requires that tail-calls be implemented without causing stack growth. This is difficult to do portably in Java (or for that matter in C), since implementing it efficiently requires low-level access to the hardware stack. There are tricks one can use (a function returns a pointer to the next function to be called, rather than calling it directly), but these are rather expensive, especially in Java (which does not have function pointers).

Compiler optimizations can re-write many tail-calls into gotos. The most important case is self-tail-calls or tail recursion. Kawa rewrites these when it can prove that is safe. It is rather conservative: it must be able to prove that the procedure binding cannot be re-assigned to some other value. Thus only letrec tail-recursion is supported, not tail-recursion of global procedures. This restriction may be excessively paranoid, but it is required by the language, and it is sufficient to optimize the standard do and named-let forms. A future version will provide a way to specify that Kawa can be less conservative (using either a module system or compiler switches).

While full tail-call elimination will probably never be supported by Kawa (except perhaps on modified Java interpreters), some forms of mutual tail-recursion can be eliminated with more sophisticated compiler analysis, though there are no concrete plans for that yet.

Classes, types, and declarations

An extensive type system that supports user-defined classes is important for extensibility and for integration with other Java packages. This section describes current ideas for future extensions.

The record extension proposed (and rejected) for R5RS will probably be added. It has been partially written; however a new implementation based on the JDK 1.1 reflective features seems preferable, since that means the record accessor features can be used on arbitrary Java objects. After than, the next step is to add methods, and we have a way to define and access Java classes from Scheme.

Type declarations and some form of first-class types are also planned. This will improve documentation, error-checking, and code efficiency. Among the types to be supported are "unboxed" number types, so the compiler could use raw double instead of having to allocate a DFloNum object. In addition to allowing better code, this feature will also make it it easier to integrate with primitive Java types. (The define-virtual syntax in section Low-level procedures will be extended to support type values.)

Some kind of module system is desirable. One reason is to control names and names clashes; another reason is so that the compiler can map global variables references to their definitions. This makes it easier to do better optimizations. It is tempting to define a module as a kind of class: A simple module is a packaging of data and function definitions, and it is easy to map these into static fields and methods of a class. Controlling which components are exported is similar to specifying visibility (public or private). What is more challenging is how to model import lists, or module signature/interfaces. Perhaps we can use Java interface types.

Given all this new infrastructure (types and modules), then the compiler will need some re-writing. At the very least it needs to know that expressions may have other types than just plain Object.

Current and Future Work

There is very preliminary threads support in Kawa. It provides an interface to Java threads that looks somewhat like delay, except that the delayed expression is evaluated in a new thread. (The model is similar to to the "futures" concept of MultiScheme J. S. Miller: MultiScheme: A Parallel Processing System based on MIT Scheme, PhD thesis, Department of Electrical Engineering and Computer Science, MIT, Aug. 1987. MIT-LCS//MIT/LCS/TR-402.)

but there is no implicit force, at least yet.) Recent re-implementation of core classes (such Environment and Translator) has been done to support threads with optionally separate top-level environments.

An interface to graphics primitives is needed, but no work has been done. Unfortunately, Java's Abstract Windowing Toolkit is commonly considered among the weaker parts of Java, and lacks important features, such as a decent multi-line text editor.

More sophisticated Scheme code rewriting, optimizations, and inlining should also be investigated after Kawa has been taught to know about types as discussed in section Classes, types, and declarations.

One important goal of Kawa is that it should support multiple languages. This includes an interface to plug in new parsers, pre-defined functions, data types, and output formatting. Of special interest is re-implementing some of the ideas and syntax from my earlier Q language. These include a line-oriented syntax with fewer parentheses, and high-level sequence and array operations (as in APL).

Also of interest is support for Emacs Lisp. This would require an extensive library to implement the Emacs data types (such as buffers and windows), in addition to the challenges of the Emacs Lisp language itself (it has different data types and name binding rules than Scheme), but may be a good way to build a next-generation Emacs.

Conclusion

Kawa is a solid implementation of Scheme with many features. It is portable to any environment that can run Java applications. It has active use and development. I have not bothered running benchmarks, because the state of the art in Java implementation is in such flux, because many optimizations are planned but have not been implemented, and because the different feature sets of the various Scheme implementations makes them difficult to compare fairly. But Kawa has the potential of being a reasonably fast Scheme implementation (though probably never among the very fastest), and will reap benefits from current efforts in Java compilation (see P.~Bothner: A Gcc-based Java implementation. IEEE Compcon 1997 Proceedings (to appear), 1997).

Biographical Information

Per Magnus Alfred Bothner studied at the University of Oslo, Norway, and received his Ph.D. in Computer Science from Stanford University in 1989. He then worked for Digital Equipment; University of Wisconsin; and has worked for Cygnus since 1991. He can be reached at <bothner@cygnus.com> or <bothner@gnu.bai.mit.edu>.

His highly unimpressive home page is http://www.cygnus.com/~bothner, but at least it tells you where to get Kawa.