Byte-compilation of Scheme using Java byte-codes

May 1996

(This design document discusses the issues of implementing Java in Scheme. It predates the implementation, and does not necessarily match the current Kawa implementation.)

Our biggest outstanding deliverable for Guile is a byte-compiler and -interpreter. Using Java has been suggested, and there is wide-spread interest in using Java byte codes for other languages. For example, InterMetrics is porting their Ada95 compiler to emit Java byte-codes. See here for more languages that use Java.

There is already an implementation of Scheme on top of Java. This is "Kawa", written by R. Alexander Milowski. There are some major problems with Kawa 0.1, and it needs a lot of re-writing, but it may be a useful starting point.

There are some different possible targets for Scheme-in-Java:

A Scheme byte-code compiler with support libraries that will run on any Java implementation. The support libraries will implement the Scheme primitives. For maximum portability, it should be written in Java, but it is also possible to use "native" methods (i.e. written in C).
The same, but we target a specific Java implementation that has some extensions to support Scheme better.
A byte-code compiler that targets a Java interpreter that is implemented on top of a Scheme system. For example, we could target the (very incomplete) "Latte" sub-system of Guile.
A hybrid system: Depending on compiler switches, it could emit either portable bytecodes or Scheme-enhanced bytecodes. The latter might be either more efficient or provide features not available in the portable Scheme-in-Java implementation.

Related design issues include:

Execution efficiency.
Class file size and loading speed.
Need for general tail-call-elimination. If general tail-call-elimination is required, then it is not possible to produce portable bytecodes with a direct translation, though it is possible to use Henry Baker's trick (a function returns its continuation to its caller), which is also used in RScheme.
Need for full or restricted continuations. It is not possible to implement general call/cc using direct translation to portable bytecodes. It is possible if you re-write to continuation-passing style. Restricted call/cc that only returns to a caller is possible.

Data types

One issue how to represent Scheme values using Java types. Because of Scheme's dynamic typing, we represent Scheme values using Java object references. Scheme objects inherit from the Object class, like all Java objects. One might add an intermediate class Scheme_Object, and have all Scheme types be derived from it. That is possible, but I don't think it is a good idea. I think it is better to re-use existing Java classes. For example the Scheme values #f and #t should be the Java objects Boolean.FALSE and Boolean.True. This does imply that one cannot use that standard Java print functions (since they would print in Java syntax, rather than Scheme syntax).

Traditionally, Scheme (and Lisp) implementations represent objects using a short union, with special bit "tags" to distinguish heap objects and immediate values (such as fixnums). This allows some operations (including fixnum arithmetic) to be more efficient. Unfortunately, there is no way to do this with Java bytecodes. Olin Shivers proposes a Java extension to allow tagged types. I don't think this has much chance of being standardized, and I think pre-allocation of common values (characters and small ints) can give some of the benefits without changing the VM.

Here are my suggestions for representing the various Scheme types:

#f and #t

Use the standard Java class Boolean.

characters

Use standard Java class Character. Two minor problems:

Java only supports 16-bit Unicode characters. It is probably possible to make a Java system that can support more characters, while still supporting standard Java bytecode files.
Character values would not compare eq?, unless we provide a system-wide "obarray" table for characters. Luckily, Scheme does not require characters that are equal? to be eq?.

numbers

Use the various standard Java classes. It may be reasonable to statically allocate say the integers -1 ... 100. Alternatively, one could use a global (weak-?) hash-table to ensure equal? integers compare eq?.

pairs

Add a new Pair class.

()

We could use the Java NIL value, or some dummy object.

symbols

Use the Java String class. Use String.intern() to make symbols unique.

strings

Use the standard Java StringBuffer class.

vectors

Use a Java array of Object.

procedures/closures

We have to use special objects. This merits further discussion (below).

Procedures

Most Scheme functions are fairly straight-forward to compile to Java byte-codes. However, there are some complications:

Scheme functions are first class values, that can be passed around as objects.
Nested functions may require closures to be created.
Tail-calls are required to not cause stack growth.
First-class continuations may require the stack state to copied.

Java does not have function values. They can be simulated with a new Procedure class with a (virtual) apply method. Each function gets implemented as a separate sub-class of Procedure, and the Scheme code of the function is compiled into an apply method for that Procedure sub-class.

In some cases, a number of functions can use the same Procedure sub-class. For example, the c[ad]*r functions could be implemented with a single Procedure sub-class, with a field containing an encoding of the [ad]*. The apply method uses that field to do the right thing.

A closure requires a Procedure sub-class with a field that contains the static context. The static context can be an array, though it can also be an Object of some suitable class. Evaluating the nested lambda expression will allocate a context object, and then "new" the Procedure sub-class, passing it the context object. The apply method well get non-local bindings using the context object.

Tail call elimination

The most important use of tail-calls is probably tail-recursion, where a procedure calls itself (by name). This is easily compiled using a Java GOTO byte-operation. (This is more difficult if translating to Java source code, since Java does not have a goto statement, though Java byte-codes do.)

Mutual tail-calls among multiple functions compiled together can be handled by compiling all the functions in one big procedure, and and starting it with a switch operation to jump to the code for the desired function. In that case, a tail-call to one of the other functions can easily be compiled as a GOTO. It is not clear whether this is worthwhile, given that it still does not get you general tail-call elimination.

General tail-call elimination can be implemented using Baker's trick (a function returns the address of the next function to call), but this constrains you calling convention strongly, and for various reasons is messy and undesirable in a Java context.

The ideal way to support general tail-call elimination is to just do it in the Java interpreter (either for all tail-calls or using a new operation). The more optimized and tuned the interpreter is, the more likely is that that this will be tricky and machine-dependent, since it may involve messy adjustments to the C stack. It has been proposed that gcc be enhanced to support a new (extra) calling convention such that the called function be responsible for popping the arguments of the stack (rather than the caller). C programs could specify an __attribute__ to indicate that a function should use the new calling convention (thus regular functions can call functions using the new convention, and vice versa). If the Java interpreter is carefully written to use the new calling convention where appropriate, then it should be fairly easy to implement general tail-call elimination. If Scheme code is compiled to native (C or assembler), it can also use the new calling convention, thus getting full tail-call elimination. The FSF (i.e. Stallman) is favorable to this approach, which would be quite a bit of work, but would be generally useful to anyone who needs general tail-call elimination.

Continuations

Implementating continuations (specifically the call/cc function) may be straight-forward in some environments. However, it cannot be implemented in Java, nor can it be implemented in portable C code.

One option is to not worry about fully implementation continuations. Most uses for continuations are to implement exception handling, threads, and back-tracking. Exception handling and threads are already provided by Java. Continuations used to implement (non-resuming) exception handling are a special case of continuations that only return to stack frames in the current call chain; these can be implemented using a combination of Continuation objects and exception handling. Guile already supports threads (not implemented using continuations), as does Java; using continuations to switch threads in such an environment is probably a bad idea. We do not have a ready interface for back-tracking, except by using continuations, but I don't think many Scheme applications use back-tracking.

If our Java-byte-code interpreter is embedded in a Scheme system that already supports continuations, then it is trivial to implement call/cc.

An alternative is to have the compiler transform a Scheme program to a form which does not need explicit continuations. One option is to translate to continuation-passing-style. I don't understand this will enough to comment on the implications.

Jeffrey Mark Siskind is working on a highly optimizing compiler to translate Scheme to C. It could possibly be re-targeted to emit Java byte-codes. However, it is written for for a "closed-world assumption" with global analysis of the whole program. There is as yet no support for separate compilation of independent modules, though that is supposedly planned. In any case, if highly-optimized Scheme code is important, there are probably ideas (and code) we can use in Stalin. Also, William Clinger has expressed plans to re-target his Twobit optimizing compiler to emit Java byte-codes, and he believes it should be easy to do.

File format

A Java .class file contains more than just code. It also includes symbol-table information, constants, and other meta-data. We need to consider how this works for Scheme. One point to note is that a Java .class file contains the compilation of only a single class. I believe it is possible to compile a Scheme source file (or module) and only write a single .class file. However, that would restrict your implementation choices a bit, and otherwise warp the resulting structure, so it may be prefereable to generate multiple classes. This will require the use of multiple "helper" .class files, which is clumsy. It may be reasonable to extend the .class format to allow nested classes.

Constants

Scheme code includes constants of two kinds: Simple literals, and compound quoted S-expressions. Simple literals occurr in other languages (including Java), and are handled similarly:

Standard constants (such as #f are compiled into references to static members of standard classes (such as Boolean.FALSE).
Quoted Symbols can be compiled the same way the Java compiler handles string literals.
Strings and vectors be be compiled into references to static fields that get initialized at class-loading-time.
Character and integer literals can also be handled using references to static fields initialized at load time. Unboxed numbers (if used) can be compiled in the same way the Java compiler handles number literals. An extended Java interpreter can use other methods.

Quoted S-expressions can be evaluated at class-loading-time, and a reference to the result stored in a class-static field.

Symbol table information

Loading a compiled Scheme program requires that definitions get elaborated, and inserted into the Scheme global symbol table. This can be done by the class's static initializer method, which gets evaluated at class-loading time. No special magic is needed in the .class file to declare the various functions and variables. However, if one has a system that does separate compilation with a module system, it may be desirable to extract type and symbol information from a previously-compiled module. This requires at least some conventions for how to access symbol information.

Eval

Implementing eval, load, and a read-eval-print loop can be done in various ways.

If the bytecode interpreter is inside a Scheme interpreter, then the latter presumably already has eval. We can use that. We would need gateways between the different interpreters, so different kinds of functions can call each other. To do that, the calling convention for Java methods should be based on the already existing Scheme calling convention. Even if we do a Scheme-in-Java system from scratch, it is not very difficult to write a simple interpreter for eval.

Alternatively, we can use bytecode for interpreted code too: Code that needs to be evaluated is compiled to bytecodes, and then immediately loaded and executed. Java provides a methods to load a class from a bytearray in memory, so it is not necessary to write out a temporary file. One problem is that Sun's current Java release does not garbage collect classes, so code that results from eval is never freed. This should seldom be a problem, except possibly in a long-running interactive session, since well-written Scheme code seldom uses eval. The problem can be reduced if we only byte-compile function definitions, but immediately evaluate other top-level expressions. And of course one can use a Java interpreter that does garbage collect classes.

The bytecode-based RScheme system implements eval by compiling and immediately executing bytecodes. It has support for multiple bytecode interpreters, and has immediate ("raw" or "unboxed") numbers, so it would probably be a strong candidate for a combined Scheme/Java system.

Finally, it does not take much code to write a simple Scheme interpreter, This can be byte compiled, and linked in with the rest of the system. This makes most sense in a Scheme system that emphasises large applications, and uses a compiler that does global optimization, such as Stalin.