Foundation Design Document

Foundation is an integrated package of writing tools.  The system
gives special emphasis on getting the tools to talk to each other and
on quickly propagating changing data through a large network of
interlocked data structures.  In writing tools, especially text
formatters, changes made in one part of a document have the potential
to affect the whole document.  It is vitally important to be able to
control the propagation of change so that the most important things
are done first, and unnecessary work is avoided.

Foundation serves as a development test bed for the development of new
writing tools.  Everything is written to be modular so that subsystems
may be modified or replaced without seriously affecting the rest of
the system.  To this end, the user command set is considered to be a
seperate module.  Each module is designed in terms of what functions
it provides leaving how those functions are called to the user
interface module.  The system supports explicit interfaces between
modules.  Any coupling between modules should happen through the
explicit interfaces.  This is an important step in minimizing the
impact changes in one module have on the system as a whole.

The most important issue in the implementation is complexity
management.  Both the user and the developer need to be able to
understand the system without strain.  Several rules of thumb are
being tried.  The primary one, which will be used throughout the
system and will be ruthlessly enforced is the "seven plus or minus
two" ( 7 +/- 2 ) rule.  Cognitive scientists say that human beings can
hold between five and nine seperate ideas in short term memory at a
time.  If this is true, then the system should be organized so as to
avoid taxing a human being's cognitive limitation.  Everything in
Foundation is organized in hierarchies with no more than nine
elements.  This may seem like a trivial point, but this particular
rule of thumb is a serious attempt to explicitly address complexity
management in a computer program, and in a user interface.

TWO MAJOR PARTS

Foundation comes in two parts, the support system, and the
applications which run on top of the support system.  Each
application, and the support system itself are considered major
components of the entire integrated Foundation system.  They are
independent in that they communicate through interfaces.  They are
integrated, in that they communicate extensively through their
interfaces.  A particular application should not know so much about
the operation of another application that it can control it, but it
should know what action to take if data it cares about is changed in
certain ways by another application.  A master / slave relationship is
too tight a coupling for modules.  Modules must remain autonomous.
Multiple committee members acting on changing data is the correct
paradigm here.

Here is a list of the functions provided by the support system:

  Memory Management
    Real time storage allocation/reclamation
    Blocks of space that grow and shrink
    Blocks of space that have rigid size
  Input / Output Interfacing
    Scheduling of input events and multiple input files and streams
    Display of text and graphics information (multiple fonts)
    Multiple output files and formats
  Task scheduling
    propagation of changes of data within and across applications
    abstractions will limit depth of changes while increasing breadth
    Scheduling of actions taken on changes in data
  Self monitoring
    Audit of input from streams
    Profiles of execution of the various modules

Here are a few of the applications to be implemented in Foundation:

  Editor - Text editor, screen oriented.
  Outliner - Tool for facilitating the process of outline.
  Free Writing Tutorial - Tutorial for helping break writer's block.
  Annotator - A way of adding marginal notes or other side annotations
    to a document which refer to the text but do not become part of it.
  Formatter - What you see is what you get integrated editing/formatting.

SUPPORT

Memory Management

	Caltech developed more efficient versions of the standard UNIX
system calls for malloc, realloc, and free.  We have taken those
routines and added some additional debugging modes to them.  Space is
requested by a calling malloc with the size of the block needed.  The
call returns with a pointer to a block of space that has AT LEAST that
much space.  When the allocated block needs to be enlarged, the
pointer is passed to realloc with a specification of the new size.  If
the block is already big enough, then the internal housekeeping
numbers are changed, and the pointer is just returned.  Otherwise, a
new block is allocated, and the old contents are copied.

	Blocks of storage are kept in buckets.  The raw storage is
acquired in chunks with a granularity of 1024 bytes.  This size
matches the machine's page size to efficiently interface with the
memory management hardware.  For smaller blocks than 1024 bytes the
chunks are subdivided.  For larger blocks, the space must be a
multiple of 1024 bytes.  The blocks are stored in caches.  There is
one cache for each size of space.  The sizes range from 8 bytes to
2^32 bytes.  Making space available to applications incurs a little
overhead.  At minimum it consists of a flag indicating if the space is
allocated or free, an index into the heap of caches, and a size
parameter.  Additionally it can contain magic words to catch accesses
that would be out of bounds, or counters to measure how many times the
memory was allocated and freed.  The amount of actual space available
is the raw size minus the overhead.

	A special debugging mode has been added to the storage
allocator.  Since blocks are allocated in a few fixed sizes, there is
often unused space at the end of a block that the user has asked for.
In the debugging mode, this space is filled with a known test pattern
at allocation time, and scanned at the times of reallocation or
freeing.  If the test pattern has been damaged, an error is reported.
The test pattern would be damaged any time the user changed data that
was un-asked for.  Ordinarily run-time bounds checking on block
accesses impairs efficiency too much.  With this debugging mode, it is
possible to determine if some routine is mis-behaving.

	Additionally, freed storage will get set to a known test
pattern in the debugging mode.  This will give even more opportunity
to notice accesses to memory that is out of bounds.

Design of the Internal Data Structures

	Foundation is fundamentally a text manipulation system.
Therefore the data structures are organized to do text manipulation
well.  At all points in the design, hooks have been left for when
graphics and bitmaps are also manipulated as part of a text.  The goal
is to support multiple viewpoints on easily accessible and changeable
chunks of data.

Elastic Blocks for Raw Data

	Elastic block is the name given to blocks of storage that grow
and shrink dynamically during their use.  Because the storage
allocator enlarges a block of space by allocating a new block and
copying the data, an elastic block is accessed indirectly through one
known point.  Any sharing takes place through that one point.  Anyone
who attempts to share an elastic block by copying the pointer to the
block will suffer painful and difficult to find errors.  That is
because, when the structure is re-sized the copy of the pointer will
point at a block on the free list instead of the allocated block.

	Data abstractions built on top of elastic blocks are designed
to act alike functionally.  Although they hold different types of
data, and are applied in very different ways, the functions used to
manipulate them are the same.

	All of the elastic block data abstractions have the following
functions:

Create	Create a new empty block of data.
	Takes no arguments, returns pointer to empty block.

Destroy	Free the block and forget about it.
	Takes pointer to block pointer.  Frees block and zeros the
	block pointer.

Size	Tell the current size of the data block.
	Takes pointer to block and returns integer size.

Find	Locate a pattern in the data.
	Takes pointer to block, pointer to pattern (which is also a
	block), offset to start from and direction to search.  Returns
	index where pattern begins, -1 if no match.

Delete	Remove some data and resize block smaller.
	Takes pointer to block pointer, offset into block where
	deletion should occur, and count to delete.  May change block
	pointer.  Returns -1 if attempt was made to delete more than
	all.

Widen	Enlarge a data block shifting data down if necessary.
	Takes pointer to block pointer, offset to begin opening new
	space, and count of space to add.  Returns -1 if offset is
	past the end of the block, or if the widen fails to get more
	space.  The hole is filled with zeros.

	Although it would be fairly simple for someone to do all these
operations by hand, they are provided by the system for all the
elastic blocks.  That way all the structures are guaranteed to
behave uniformly.  Also any special coupling between the storage
allocator and the functions can be handled in a well controlled way.
Finally, somebody has to write the functions, it may as well be done
right the first time.  Frequently the functions will be implemented in
macros which will violate the storage allocator abstraction in the
interests of increased efficiency.

	It was decided not to implement generic functions for
overwriting or accessing the dynamic data.  C provides a powerful and
efficient way of manipulating data through pointers.  That system
would be slowed down and made more complicated by any accessing
routines written.  The temptation not to use such routines was deemed
too great.  Therefore, accessing outside the block is a ghastly error!
There is the debugging mode for the storage allocator which will catch
these types of errors.  The important thing about the debugging mode
is that it can be turned off for users, and on for developers and
maintainers.

	Raw character data is stored in an array of characters (char *
in C).  This array is referred to with the equivalent name Char.  The
data is assumed to be byte sized and pointer offsets access the data
in bytes.  When an insertion is to take place that will enlarge the
array, the Widen routine is used and then the new text is added.

	There is an abstract data type called Array.  Note the
capitalization to distinguish it from the generic computer term array.
An Array is a pointer to a C language array that holds 32 bit
quantities which can be either pointers to other structures, or long
integers which can be used for such things as offsets into other
structures, or sets of constants.  Although it is harder to insert an
item into the middle of an Array than into a linked list, the access
is simpler and faster.  (Just add the integer offset to the pointer to
the first Array element.  C does pointer arithmetic very efficiently.)
A pointer offset accesses the elements in an Array in 32 bit long
words.  In an Array, any free space appears at the end.

	Elastic blocks for graphics information and for bitmaps
will be added later so that line drawings and scanned images can be
stored and manipulated.

Rigid Blocks for Everything Else

	Because of the tendency of pointers to elastic blocks to
become invalid after re-sizing operations, most of the rest of the
data abstractions are build on rigid blocks.  They are still created
and destroyed dynamically through the storage allocator, but they
won't change size, and they can be pointed at by numerous entities.

How Text works

	The storage of raw text in elastic blocks has already been
described.  To access and manipulate collections of raw text there are
two higher level abstractions: Text, and Region.

	The Text abstraction permits sharing of blocks of text.  A
Region is a mechanism for refering, as a unit, to a body of text that
may encompas parts from several Text blocks.  Regions can overlap.
Overlap makes things tricky when text is added to a block shared by
many regions.

	Here is the C language definition of the Text structure:

typedef struct text
{
  Char	text_chars;
  Array	text_regions;
} *Text;

	The text_chars component points at the raw data.  In
operations on the elastic blocks such as Delete or Widen, a pointer to
text_chars is passed.  This is the only place permitted to know the
actual address of raw characters.  All other accesses go through
indirecting through the Text structure.

	The text_regions component is an array which contains a
pointer to every region that uses text from this block.  In this way,
when the block changes, all the offsets recorded in a region can be
updated to reflect the correct new positions of the text.

How Regions Work

	The central data abstraction in Foundation is the Region.
Regions control the sharing of text, the derivation of text based on
other text, and the procedures that get called when text changes.
Regions hold the house keeping information for data storage and
propagation.  Multiple applications operating on common bodies of text
is supported through the Region abstraction.  At the lower level is
the Text.  When two applications wish to share text they each define a
Region that looks at some proper or improper subset of text that is
stored within Foundation.  Both Regions register in the text_regions
Array that they care about the text.  When an application modifies
text accessed through one Region, procedures in the other one will be
run to take stock of the change.  When an application wishes to derive
text from shared text, it makes a region and uses the region_derived
Array to point at text that it creates.  The underlying idea here is
that the derived Array will contain the old copy and procedures stored
in the region_ops array will run to create a new copy from the shared
text after the shared text has been changed.

	Here is the C definition of a Region:

typedef struct region
{
  int	region_type;
  Array region_parms;
  /* Tuple for elements in the region */
    Array region_texts;		/* Array of texts to scan */
    Array region_starts;
    Array region_ends;
    Array region_ops;
    Array region_derived;	/* Stuff derived from each text */
    Array region_d_key;		/* Back key on derivation */
  /* Keep the above arrays in Sync OR ELSE! */
} *Region;

	The three arrays region_texts, region_starts, and region_ends
define a triple which names a region of text within an elastic block.
The texts element points at the Text structure for it.  The starts
element gives an integer offset for the character that begins the
region.  The ends element gives the offset for the character that ends
the region.  It goes without saying that these three arrays MUST be
kept in sync with each other.

	By now the code necessary to insert text into a big buffer
referred to by many regions should be clear:  The Char array that
holds the raw text is widened to accept the new text.  The array of
text_regions is traversed.  For each region found in the array, the
region_texts is searched for a match on the Text pointer.  For each
Text pointer that matches, the starts and ends boundries are checked
and if they fall within the scope of the change, they are updated.
For boundries that start before the change, nothing happens.  For
those that end before the change nothing happens.  For those the begin
after the change, they are incremented by the number of characters
added.  For those that end after the change, they are incremented by
the number of characters added.  Text added before a region shifts the
virtual text down.  Text added into a region expands the region.  Text
added after a region has no effect.

	The region_ops Array is also kept in sync with the texts,
starts, and ends Arrays.  It contains a pointer to a C procedure to be
called if the contents within the delineated region changes.
Redisplay is an obvious application for this array.  When new text is
added to the region, a call to a procedure that compares the new
contents of the text block to the block that is on-screen and takes
appropriate action.  If there is a zero in the Array element, no
procedure is called.

	The region_derived and region_d_key arrays are for things that
are derived from the contents of shared text.  For example, a
formatted paragraph can be derived from raw unformatted text.  A
display can be derived from raw text.  A series of outline heading
numbers can be derived from raw text.  The derived data is stored in
the region_derived Array.  For each text section there is a
corresponding (but possibly null) derivation.  For times when a key
must associate with a derivation to enable access back to the shared
text, or to provide an additional way of accessing the derivation,
there is the region_d_key Array.  It too is kept in sync with the
other elements of the tuple.  The only restriction on what can be kept
as an element in the region_derived Array is that it be a 32 bit
quantity.  That means integers can be stored directly.  Pointers to
text can be stored directly, or pointers to structures can be.
Clearly this restriction is not seriously confining.

	Redisplay is an example of the use of the d_key array.  When
an exposure event occurs, the source text must be found corresponding
to a window rectangle.  The d_key array contains a key computed from
the rectangular area that the text covers.  When the given rectangle's
key matches, the text to display has been found.

	The region_parms Array contains any parameters specific to a
particular class of regions.  For example, a pointer to a name string
would be held there if the region permitted a symbolic name.  For
hierarchies of regions, there would be an integer declaring what level
of nesting this region was out of how many possible nests.

	The region_type integer is a simple numerical token that
identifies what class of region this is.  Given the region type, the
specific contents of the region_parms array can be determined.  Each
class of region will have a specific structure declared which will
give all the components of the parms array.  When that type of region
is encountered, a specific element in the array (such as the name) is
accessed in the following way:

typedef struct doc_parms
{
  char *doc_parm_name;
  char *doc_parm_type;
  int doc_parm_depth;
} *Doc_parms;

Doc_parms indirect;
Region	foo;
char	*name;

indirect = (Doc_parms)foo.region_parms;
name = indirect->doc_parm_name;

	In essence, the generic Array pointer is cast to a pointer to
the specific structure, and accessed following the structure template.


Input / Output Interfacing

	After memory management, the next lowest level of support
provided by Foundation is interfacing and scheduling for data input
and output.  Interfacing makes it easier for applications to access
outside data.  Scheduling enables multiple I/O streams to be handled
concurrently.

	There are three classes of I/O: Files, Streams, and Windows.
Files are just disk files that virtually every C programmer is well
familiar with.  Streams are similar to files.  They represent
connections to other processes.  In Berkeley 4.2 Unix and others
compatible with that implementation of inter-process communication, an
augmented file descriptor is a handle on a stream of data to other
processes.  These processes can be running on the local host or on
some other host on the network.  Windows are treated as a special case
of output.  Although a window, too, is accessed through a file
descriptor, the X window system supports asynchronous events, and
multiple windows and multiple client processes talking to the same
display device.  Additional housekeeping is done within Foundation to
make it easier to get bits onto the screen where you want them.

	File interfacing and scheduling is well understood.  With the
file descriptor in hand, you read until the end of file, or you write
until your buffer is empty.

	Stream interfacing and scheduling is a little tricky.  As with
files, you can write until your buffer is empty.  Reading is the
tricky part.  An end of file on a stream occurs when the other process
decides it doesn't want to ever talk on the channel anymore.  Before
that time, the process may well halt its stream of data for an
indefinite length of time while it figures out what to say next.  Our
process cannot be permitted to block while it is waiting for the other
process to continue transmitting.  Therefore for streams the select
system call is used before any read is attempted to make sure there
are data to read on the file descriptor.  One additional complication
is that while there are input and output file descriptors, stream
descriptors are read/write.  This may mean a little more housekeeping.

	Finally windows.  In addition to the file descriptor on the
window kept in a central place, there will be a central table of
windows and what routines are responsible for them.  Various
applications will want to control redisplay in different ways.  This
will be possible.  Eventually multiple windows, multiple fonts, and
multiple re-display algorithms will be available.  The mouse is
supported through the window system as a window event.  Certain low
level keyboard operations are also supported.  It is not clear what
will be done with keyboard events.  But the mouse will definitely be
interfaced.

	The window system also has support for menus.  The first cut
of a menu oriented command processor will interface to the XMenu
system.  Further experimentation will suggest changes.

Redisplay:

	Redisplay is broken down into two cases.  In the first case,
when a window or part of one is exposed, there is an exposure event
that comes in from the X window system.  In that case, the newly
exposed part of the screen must be re-displayed by the program.  In
the second case, text changes and the contents of the screen must be
changed.

	For redisplay, the following data structures exist:

Array windows;
Array display_list;

	The window Array contains an entry for every window that
Foundation knows about.  When a window event occurs, the window
structure for the given window is found in this array.

	The display_list contains a pointers to Rectangle structures.
A Rectangle structure defines what appears within a window in a
rectangular shaped area.  The intelligence of the redisplay is to most
efficiently find the smallest rectangle which changed, and come up
with the shortest set of commands to update it in reasonable time.
Rectangles are defined so that when an X event comes in, the reported
region is easily translated to a region within 0 or more rectangles in
the display list for a particular window.  There may need to be a
special 'magic' rectangle which contains whitespace for those areas
that nothing has been drawn into.

	The text to be displayed lives in a Region derived from text
stored in the system.  How that text got there is none of the
re-display system's concern.  All that is necessary is that the
display Region procedures be scheduled when text it cares about
changes.  The text that is actually sent out to X is stored in the
region_derived area.  When the text which is shared with the
applications changes, the old actual text is used to compute the size
of the output area.  If the new output area is different, text will be
moved.  Then the new output text is inserted into the derived area,
and the text is sent to the window along with apropriate positioning,
size and font information.

	The region d_key Array for a displayed region holds a key
which can be quickly compared against a region reported by an exposure
event.  When the event comes in the display list for the correct
window is searched for the right rectangle, and the rectangle in the
display list gives the region of displayed text and a key to use in
the region_d_key Array to find the particular body of text to
re-display.

	To summarize: The exposure case gives a window number and a
rectangle.  The window array gives the display list.  The display list
matches the rectangular area that changed and gives a region and a
key.  The key gives the text to re-display.

  window -> display list -> rectangles -> derived text -> text
  Find out what text is in 'this' rectangle and display it again.


	The change of text case gives a region and a text pointer for
the characters (later other things like bits and graphics) that
changed.  The text names the particular display region as caring about
that array of text.  The display region text area is matched with its
associated derived_text area which actually appeared on the screen.
(Note that text that changed that is not on screen has no display
region!)  The derrived text and the new shared text is utilized by the
redisplay operation to compute new text to output to the window.

    text -> derived text -> rectangle -> display list -> window
    Find out if there is an on-screen rectangle for this text that
    changed.  If so, recompute size, move text around it to fit again,
    and display new.


General I/O Conventions:

	Each module performing I/O should adopt the following
conventions: It should allocate a block of storage for buffer space.
All reads and writes should move data into or out of this block of
storage.  Then routines can manipulate the data.  Ordinarily a rigid
block of storage should be allocated, and read data would be copied
and transformed as it goes into the text buffers.  On rare occasions
reads can move data right into the text elastic blocks.  The module
should request a slot in the global file descriptor table when doing
opening a file.  It should then return the slot when the file is
closed.  All input from streams should be done through the system
input scheduler, and should be non-blocking.  Finally hooks should be
left so that transcripts of input and output can be kept.

	At this point the following library routines should be
written:

	Copy from buffer into region.  Creating new text blocks on the
	  fly.
	Copy from region to buffer performing transformations.
	File, stream, window open/close operations
	Transcript capability.
	Scheduler.

<Design global I/O housekeeping structure>
<Design global Window housekeeping structure>
<Define primitives for I/O: stream scanner, window munger, mouse>
<Associations of buffer space with file descriptors with ops.>

Task scheduling


	For internal support, the cornerstone of Foundation is the
task queue.  Some tasks can be deleted from the queue if another of
the same type comes in for the same region.  Others must be completed.
If the TASK_DELETE flag is set it can be deleted.  Within the queue
are pointers to Task structures.  Within the Task structure are
descriptions of what procedure to run, what type of task is enqueued,
what region the task is being run over, and control flags.

<ARGS?  Try and use the cleverest hack so any procedure can be called
and scheduled.>

/* Task Flags */

#define TASK_NONE 0
#define TASK_DELETE 1


typedef struct task
{
  int	task_type;
  int	task_flags;
  int	task_priority;
  Region task_region;
  int	(*task_proc)();	/* Procedure pointer */
} *Task;

	The task queue, together with the generality of the Region
data abstraction, make a simple, fast, powerful, and effective way of
scheduling the handling of data propagation and task scheduling within
Foundation.  Some amount of the onus is placed on the author of the
routines that are executed.  As the task of writing such routines is
better understood, tools will be added to make common operations
easier to implement.  The queue is used in coordination with the
regions so that the procedures associated with a block of text are
scheduled and run in a prescribed order.  For that reason there is a
priority word in the Task structure.

<Add scanning feature to enqueue and permit deletion from middle.>
<Decide what, if any, use is to be made of priorities.>

Self monitoring

	Given regions and the task queue.  Auditing becomes much
simpler.  Hooks are installed in the scheduling routines to optionally
keep track of what routines are scheduled, deleted, executed.  There
are hooks in the storage allocator to watch over memory consumption.
Routines can be so written that a complete record of all insertions
and deletions are recorded in yet another region.

	Audit of input and output is a little harder.  Special hooks
will have to be added.  Turning on this facility is expected to slow
the system down, since an extra amount of copying will take place.

	Some of the monitoring functions have dubious utility when run
at the same time.  For example, if all insertions and deletions are
being recorded, the report of memory consumption will be seriously
afffected by monitoring and will not necessarily show how the
application consumes memory by itself.  Some of these cases can be
taken care of through clever application of derivations.


Summary

	This section has defined the system support functions.  It is
clear that the system supports useful things.  Examples have been
given of how to use the system support functions.  The idea of Region
and of task queue are particularly useful and powerful in the text
manipulation environment.  These functions have been designed to place
small as possible constraint on the applications.  There should be
very little coupling between modules except that which is explicitly
enabled through these interfaces.