4.2BSD Networking Implementation Notes

                     Revised July, 1983


     Samuel J. Leffler, William N. Joy, Robert S. Fabry

              Computer Systems Research Group
                 Computer Science Division
 Department of Electrical Engineering and Computer Science
             University of California, Berkeley
                    Berkeley, CA  94720

                       (415) 642-7780


                          _A_B_S_T_R_A_C_T

          This report describes the internal  structure
     of  the  networking  facilities  developed for the
     4.2BSD version of the UNIX* operating  system  for
     the VAX|-.  These facilities are based  on  several
     central  abstractions which structure the external
     (user) view of network communication  as  well  as
     the internal (system) implementation.

          The report documents the  internal  structure
     of  the  networking  system.   The ``4.2BSD System
     Manual'' provides a description of the user inter-
     face to the networking facilities.


_________________________
* UNIX is a trademark of Bell Laboratories.
|-  DEC, VAX, DECnet, and UNIBUS are trademarks of Digi-
tal Equipment Corporation.


Networking Implementation  - i -                    Contents


                     _T_A_B_L_E _O_F _C_O_N_T_E_N_T_S


_1.  _I_n_t_r_o_d_u_c_t_i_o_n

9_2.  _O_v_e_r_v_i_e_w

9_3.  _G_o_a_l_s

9_4.  _I_n_t_e_r_n_a_l _a_d_d_r_e_s_s _r_e_p_r_e_s_e_n_t_a_t_i_o_n

9_5.  _M_e_m_o_r_y _m_a_n_a_g_e_m_e_n_t

9_6.  _I_n_t_e_r_n_a_l _l_a_y_e_r_i_n_g
 .1.    Socket layer
 .1.1.    Socket state
 .1.2.    Socket data queues
 .1.3.    Socket connection queueing
 .2.    Protocol layer(s)
 .3.    Network-interface layer
 .3.1.    UNIBUS interfaces

9_7.  _S_o_c_k_e_t/_p_r_o_t_o_c_o_l _i_n_t_e_r_f_a_c_e

9_8.  _P_r_o_t_o_c_o_l/_p_r_o_t_o_c_o_l _i_n_t_e_r_f_a_c_e
 .1.     pr_output
 .2.     pr_input
 .3.     pr_ctlinput
 .4.     pr_ctloutput

9_9.  _P_r_o_t_o_c_o_l/_n_e_t_w_o_r_k-_i_n_t_e_r_f_a_c_e _i_n_t_e_r_f_a_c_e
 .1.     Packet transmission
 .2.     Packet reception

9_1_0. _G_a_t_e_w_a_y_s _a_n_d _r_o_u_t_i_n_g _i_s_s_u_e_s
 .1.     Routing tables
 .2.     Routing table interface
 .3.     User level routing policies

9_1_1. _R_a_w _s_o_c_k_e_t_s
 .1.     Control blocks
 .2.     Input processing
 .3.     Output processing

9_1_2. _B_u_f_f_e_r_i_n_g _a_n_d _c_o_n_g_e_s_t_i_o_n _c_o_n_t_r_o_l
 .1.     Memory management
 .2.     Protocol buffering policies
 .3.     Queue limiting
 .4.     Packet forwarding


9


Networking Implementation  - ii -                   Contents


_1_3. _O_u_t _o_f _b_a_n_d _d_a_t_a

9_1_4. _T_r_a_i_l_e_r _p_r_o_t_o_c_o_l_s

9_A_c_k_n_o_w_l_e_d_g_e_m_e_n_t_s

9_R_e_f_e_r_e_n_c_e_s


9


Networking Implementation  - 1 -                Introduction


_1.  _I_n_t_r_o_d_u_c_t_i_o_n

     This report describes the internal structure of facili-
ties  added to the 4.2BSD version of the UNIX operating sys-
tem for the VAX.  The system facilities  provide  a  uniform
user  interface to networking within UNIX.  In addition, the
implementation introduces a structure for network communica-
tions which may be used by system implementors in adding new
networking facilities.  The internal structure is not  visi-
ble  to  the user, rather it is intended to aid implementors
of communication protocols and network services by providing
a framework which promotes code sharing and minimizes imple-
mentation effort.

     The reader is expected to be familiar with the  C  pro-
gramming  language and system interface, as described in the
_4._2_B_S_D _S_y_s_t_e_m _M_a_n_u_a_l [Joy82a].  Basic understanding of  net-
work  communication  concepts is assumed; where required any
additional ideas are introduced.

     The remainder of this document provides  a  description
of the system internals, avoiding, when possible, those por-
tions which are utilized only by the interprocess communica-
tion facilities.


Networking Implementation  - 2 -                    Overview


_2.  _O_v_e_r_v_i_e_w

     If   we   consider    the    International    Standards
Organization's (ISO) Open System Interconnection (OSI) model
of network communication [ISO81]  [Zimmermann80],  the  net-
working facilities described here correspond to a portion of
the session layer (layer 3) and all  of  the  transport  and
network layers (layers 2 and 1, respectively).

     The network  layer  provides  possibly  imperfect  data
transport   services   with  minimal  addressing  structure.
Addressing at this level is  normally  host  to  host,  with
implicit  or  explicit  routing  optionally supported by the
communicating agents.

     At  the  transport  layer  the  notions   of   reliable
transfer,   data   sequencing,  flow  control,  and  service
addressing are normally included.   Reliability  is  usually
managed  by  explicit  acknowledgement  of  data  delivered.
Failure to acknowledge a transfer results in  retransmission
of the data.  Sequencing may be handled by tagging each mes-
sage handed to the network layer by a  _s_e_q_u_e_n_c_e  _n_u_m_b_e_r  and
maintaining state at the endpoints of communication to util-
ize received  sequence  numbers  in  reordering  data  which
arrives out of order.

     The session  layer  facilities  may  provide  forms  of
addressing  which  are  mapped  into formats required by the
transport layer, service authentication and client authenti-
cation,  etc.  Various systems also provide services such as
data encryption and address and protocol translation.

     The following sections begin by describing some of  the
common  data  structures  and utility routines, then examine
the internal layering.  The contents of each layer  and  its
interface  are  considered.   Certain  of the interfaces are
protocol implementation specific.  For these cases  examples
have  been drawn from the Internet [Cerf78] protocol family.
Later sections cover routing issues, the design of  the  raw
socket interface and other miscellaneous topics.


Networking Implementation  - 3 -                       Goals


_3.  _G_o_a_l_s

     The networking system was designed  with  the  goal  of
supporting multiple _p_r_o_t_o_c_o_l _f_a_m_i_l_i_e_s and addressing styles.
This required information to be ``hidden''  in  common  data
structures  which  could be manipulated by all the pieces of
the system, but which required interpretation  only  by  the
protocols  which  ``controlled''  it.   The system described
here attempts to minimize the use of shared data  structures
to  those  kept by a suite of protocols (a _p_r_o_t_o_c_o_l _f_a_m_i_l_y),
and those used for rendezvous  between  ``synchronous''  and
``asynchronous'' portions of the system (e.g. queues of data
packets are filled at interrupt time and  emptied  based  on
user requests).

     A major goal of the system was to provide  a  framework
within  which  new protocols and hardware could be easily be
supported.  To this end, a great deal  of  effort  has  been
extended  to  create utility routines which hide many of the
more complex and/or hardware dependent chores of networking.
Later  sections describe the utility routines and the under-
lying data structures they manipulate.


Networking Implementation  - 4 -      Address representation


_4.  _I_n_t_e_r_n_a_l _a_d_d_r_e_s_s _r_e_p_r_e_s_e_n_t_a_t_i_o_n

     Common to all portions  of  the  system  are  two  data
structures.    These   structures   are  used  to  represent
addresses and various data objects.   Addresses,  internally
are described by the _s_o_c_k_a_d_d_r structure,

        struct sockaddr {
               short     sa_family;           /* data format identifier */
               char      sa_data[14];         /* address */
        };

All addresses belong to one or more _a_d_d_r_e_s_s  _f_a_m_i_l_i_e_s  which
define their format and interpretation.  The _s_a__f_a_m_i_l_y field
indicates which address family the address belongs  to,  the
_s_a__d_a_t_a  field  contains the actual data value.  The size of
the data field, 14 bytes, was selected based on a  study  of
current address formats*.


_________________________
* Later versions of the system support variable  length
addresses.


Networking Implementation  - 5 -           Memory management


_5.  _M_e_m_o_r_y _m_a_n_a_g_e_m_e_n_t

     A single mechanism is used  for  data  storage:  memory
buffers, or _m_b_u_f's.  An mbuf is a structure of the form:

        struct mbuf {
               struct    mbuf *m_next;        /* next buffer in chain */
               u_long    m_off;               /* offset of data */
               short     m_len;               /* amount of data in this mbuf */
               short     m_type;              /* mbuf type (accounting) */
               u_char    m_dat[MLEN];         /* data storage */
               struct    mbuf *m_act;         /* link in higher-level mbuf list */
        };

The _m__n_e_x_t field is used to chain mbufs together  on  linked
lists,  while  the  _m__a_c_t  field allows lists of mbufs to be
accumulated.  By convention, the mbufs common  to  a  single
object (for example, a packet) are chained together with the
_m__n_e_x_t field, while groups of objects  are  linked  via  the
_m__a_c_t field (possibly when in a queue).

     Each mbuf has a small data area  for  storing  informa-
tion,  _m__d_a_t.  The _m__l_e_n field indicates the amount of data,
while the _m__o_f_f field is an offset to the beginning  of  the
data  from  the  base  of  the mbuf.  Thus, for example, the
macro _m_t_o_d, which converts a pointer to an mbuf to a pointer
to the data stored in the mbuf, has the form

        #define mtod(x,t)         ((t)((int)(x) + (x)->m_off))

(note the _t parameter, a C type cast, is used  to  cast  the
resultant pointer for proper assignment).

     In addition to storing data directly in the mbuf's data
area,  data of page size may be also be stored in a separate
area of memory.  The mbuf utility routines maintain  a  pool
of  pages for this purpose and manipulate a private page map
for such pages.  The virtual addresses of these  data  pages
precede  those of mbufs, so when pages of data are separated
from an mbuf, the mbuf data offset is a negative value.   An
array  of  reference  counts  on pages is also maintained so
that copies of pages may be made without core to core  copy-
ing   (copies are created simply by duplicating the relevant
page table entries in the data page map and incrementing the
associated  reference  counts for the pages).  Separate data
pages are currently used only when copying data from a  user
process  into  the  kernel, and when bringing data in at the
hardware level.  Routines which  manipulate  mbufs  are  not
normally  aware  if data is stored directly in the mbuf data
array, or if it is kept in separate pages.

     The following utility routines are available for  mani-
pulating mbuf chains:


Networking Implementation  - 6 -           Memory management


m = m_copy(m0, off, len);
     The _m__c_o_p_y routine create a copy of all, or part, of  a
     list  of  the mbufs in _m_0.  _L_e_n bytes of data, starting
     _o_f_f bytes from the front  of  the  chain,  are  copied.
     Where  possible,  reference  counts  on  pages are used
     instead of core to  core  copies.   The  original  mbuf
     chain  must  have at least _o_f_f + _l_e_n bytes of data.  If
     _l_e_n is specified as M_COPYALL, all  the  data  present,
     offset as before, is copied.

m_cat(m, n);
     The mbuf chain, _n, is appended to the end of _m.   Where
     possible, compaction is performed.

m_adj(m, diff);
     The mbuf chain, _m is adjusted in size  by  _d_i_f_f  bytes.
     If  _d_i_f_f is non-negative, _d_i_f_f bytes are shaved off the
     front of the mbuf chain.   If  _d_i_f_f  is  negative,  the
     alteration  is  performed from back to front.  No space
     is reclaimed in this operation, alterations are  accom-
     plished  by  changing  the  _m__l_e_n  and  _m__o_f_f fields of
     mbufs.

m = m_pullup(m0, size);
     After a successful call to _m__p_u_l_l_u_p, the  mbuf  at  the
     head  of the returned list, _m, is guaranteed to have at
     least _s_i_z_e bytes of data in contiguous memory (allowing
     access  via  a pointer, obtained using the _m_t_o_d macro).
     If the original data was less than _s_i_z_e bytes long, _l_e_n
     was  greater  than  the  size of an mbuf data area (112
     bytes), or required resources were unavailable, _m is  0
     and the original mbuf chain is deallocated.

     This routine  is  particularly  useful  when  verifying
     packet  header lengths on reception.  For example, if a
     packet is received and only 8 of the necessary 16 bytes
     required  for  a valid packet header are present at the
     head of the list of mbufs representing the packet,  the
     remaining  8  bytes  may be ``pulled up'' with a single
     _m__p_u_l_l_u_p call.  If the call fails  the  invalid  packet
     will have been discarded.

     By insuring mbufs always reside on 128 byte  boundaries
it  is  possible to always locate the mbuf associated with a
data area by  masking  off  the  low  bits  of  the  virtual
address.   This  allows  modules to store data structures in
mbufs and pass them around without concern for locating  the
original mbuf when it comes time to free the structure.  The
_d_t_o_m macro is used to convert a pointer into an mbuf's  data
area to a pointer to the mbuf,

        #define dtom(x) ((struct mbuf *)((int)x & ~(MSIZE-1)))


Networking Implementation  - 7 -           Memory management


     Mbufs are used for dynamically  allocated  data  struc-
tures such as sockets, as well as memory allocated for pack-
ets.  Statistics are maintained on mbuf  usage  and  can  be
viewed by users using the _n_e_t_s_t_a_t(1) program.


Networking Implementation  - 8 -           Internal layering


_6.  _I_n_t_e_r_n_a_l _l_a_y_e_r_i_n_g

     The internal structure of the network system is divided
into  three layers.  These layers correspond to the services
provided by the socket abstraction, those  provided  by  the
communication  protocols, and those provided by the hardware
interfaces.  The communication protocols are  normally  lay-
ered  into two or more individual cooperating layers, though
they are collectively viewed in the system as one layer pro-
viding   services   supportive  of  the  appropriate  socket
abstraction.

     The following sections describe the properties of  each
layer in the system and the interfaces each must conform to.

_6._1.  _S_o_c_k_e_t _l_a_y_e_r

     The socket layer deals with the interprocess communica-
tions  facilities  provided  by  the  system.  A socket is a
bidirectional endpoint of communication which  is  ``typed''
by  the  semantics of communication it supports.  The system
calls described in the _4._2_B_S_D  _S_y_s_t_e_m  _M_a_n_u_a_l  are  used  to
manipulate sockets.

     A socket consists of the following data structure:

        struct socket {
               short     so_type;             /* generic type */
               short     so_options;          /* from socket call */
               short     so_linger;           /* time to linger while closing */
               short     so_state;            /* internal state flags */
               caddr_t   so_pcb;              /* protocol control block */
               struct    protosw *so_proto;   /* protocol handle */
               struct    socket *so_head;     /* back pointer to accept socket */
               struct    socket *so_q0;       /* queue of partial connections */
               short     so_q0len;            /* partials on so_q0 */
               struct    socket *so_q;        /* queue of incoming connections */
               short     so_qlen;             /* number of connections on so_q */
               short     so_qlimit;           /* max number queued connections */
               struct    sockbuf so_snd;      /* send queue */
               struct    sockbuf so_rcv;      /* receive queue */
               short     so_timeo;            /* connection timeout */
               u_short   so_error;            /* error affecting connection */
               short     so_oobmark;          /* chars to oob mark */
               short     so_pgrp;             /* pgrp for signals */
        };


     Each  socket  contains  two  data  queues,  _s_o__r_c_v  and
_s_o__s_n_d,  and  a pointer to routines which provide supporting
services. The type of the  socket,  _s_o__t_y_p_e  is  defined  at
socket  creation  time  and used in selecting those services
which are appropriate to support it.  The supporting  proto-
col  is selected at socket creation time and recorded in the


Networking Implementation  - 9 -           Internal layering


socket data structure for later use.  Protocols are  defined
by  a table of procedures, the _p_r_o_t_o_s_w structure, which will
be described in detail  later.   A  pointer  to  a  protocol
specific  data  structure, the ``protocol control block'' is
also present in the  socket  structure.   Protocols  control
this  data structure and it normally includes a back pointer
to the parent socket structure(s) to allow easy lookup  when
returning  information  to  a  user (for example, placing an
error number in the _s_o__e_r_r_o_r field).  The other  entries  in
the   socket  structure  are  used  in  queueing  connection
requests, validating user requests, storing  socket  charac-
teristics  (e.g.   options  supplied at the time a socket is
created), and maintaining a socket's state.

     Processes ``rendezvous at a socket'' in many instances.
For  instance,  when a process wishes to extract data from a
socket's receive queue and it is empty, or lacks  sufficient
data  to  satisfy the request, the process blocks, supplying
the address of the receive queue as an ``wait channel' to be
used in notification.  When data arrives for the process and
is placed in the socket's  queue,  the  blocked  process  is
identified by the fact it is waiting ``on the queue''.

_6._1._1.  _S_o_c_k_e_t _s_t_a_t_e

     A socket's state is defined from the following:

        #define SS_NOFDREF       0x001  /* no file table ref any more */
        #define SS_ISCONNECTED   0x002  /* socket connected to a peer */
        #define SS_ISCONNECTING  0x004  /* in process of connecting to peer */
        #define SS_ISDISCONNECTING      0x008/* in process of disconnecting */
        #define SS_CANTSENDMORE  0x010  /* can't send more data to peer */
        #define SS_CANTRCVMORE   0x020  /* can't receive more data from peer */
        #define SS_CONNAWAITING  0x040  /* connections awaiting acceptance */
        #define SS_RCVATMARK     0x080  /* at mark on input */

        #define SS_PRIV          0x100  /* privileged */
        #define SS_NBIO          0x200  /* non-blocking ops */
        #define SS_ASYNC         0x400  /* async i/o notify */


     The state of a socket is manipulated both by the proto-
cols  and the user (through system calls).  When a socket is
created  the  state  is  defined  based  on  the   type   of
input/output  the  user wishes to perform.  ``Non-blocking''
I/O  implies a process should  never  be  blocked  to  await
resources.   Instead,  any  call  which  would block returns
prematurely with the error EWOULDBLOCK (the service  request
may  be  partially  fulfilled,  e.g. a request for more data
than is present).

     If a process requested ``asynchronous'' notification of
events  related  to the socket the SIGIO signal is posted to
the process.  An event is a change in  the  socket's  state,


Networking Implementation  - 10 -          Internal layering


examples of such occurances are: space becoming available in
the send queue, new data available  in  the  receive  queue,
connection establishment or disestablishment, etc.

     A socket  may  be  marked  ``priviledged''  if  it  was
created  by  the  super-user.   Only priviledged sockets may
send broadcast packets, or  bind  addresses  in  priviledged
portions of an address space.

_6._1._2.  _S_o_c_k_e_t _d_a_t_a _q_u_e_u_e_s

     A socket's data queue contains a pointer  to  the  data
stored in the queue and other entries related to the manage-
ment of the data.  The following structure  defines  a  data
queue:

        struct sockbuf {
               short     sb_cc;               /* actual chars in buffer */
               short     sb_hiwat;            /* max actual char count */
               short     sb_mbcnt;            /* chars of mbufs used */
               short     sb_mbmax;            /* max chars of mbufs to use */
               short     sb_lowat;            /* low water mark */
               short     sb_timeo;            /* timeout */
               struct    mbuf *sb_mb;         /* the mbuf chain */
               struct    proc *sb_sel;        /* process selecting read/write */
               short     sb_flags;            /* flags, see below */
        };


     Data is stored in a queue as a  chain  of  mbufs.   The
actual  count  of  characters  as well as high and low water
marks are used by the protocols in controlling the  flow  of
data.   The  socket  routines  cooperate in implementing the
flow control policy by blocking a process when  it  requests
to  send  data  and the high water mark has been reached, or
when it requests to receive data and less than the low water
mark  is  present  (assuming  non-blocking  I/O has not been
specified).

     When a  socket  is  created,  the  supporting  protocol
``reserves''  space  for  the send and receive queues of the
socket.  The actual storage associated with a  socket  queue
may  fluctuate  during  a  socket's lifetime, but is assumed
this reservation will always allow  a  protocol  to  acquire
enough memory to satisfy the high water marks.

     The timeout and select values are  manipulated  by  the
socket  routines  in  implementing  various  portions of the
interprocess  communications  facilities  and  will  not  be
described here.

     A socket queue has a number of flags used in  synchron-
izing access to the data and in acquiring resources;


Networking Implementation  - 11 -          Internal layering


        #define SB_LOCK           0x01   /* lock on data queue (so_rcv only) */
        #define SB_WANT           0x02   /* someone is waiting to lock */
        #define SB_WAIT           0x04   /* someone is waiting for data/space */
        #define SB_SEL            0x08   /* buffer is selected */
        #define SB_COLL           0x10   /* collision selecting */

The last two flags are manipulated by the system  in  imple-
menting the select mechanism.

_6._1._3.  _S_o_c_k_e_t _c_o_n_n_e_c_t_i_o_n _q_u_e_u_e_i_n_g

     In  dealing  with  connection  oriented  sockets  (e.g.
SOCK_STREAM)  the  two  sides  are considered distinct.  One
side is termed _a_c_t_i_v_e, and  generates  connection  requests.
The  other  side  is  called  _p_a_s_s_i_v_e and accepts connection
requests.

     From the passive side, a socket  is  created  with  the
option SO_ACCEPTCONN specified, creating two queues of sock-
ets: _s_o__q_0 for connections in progress and _s_o__q for  connec-
tions  already made and awaiting user acceptance.  As a pro-
tocol is preparing incoming connections, it creates a socket
structure   queued   on   _s_o__q_0   by   calling  the  routine
_s_o_n_e_w_c_o_n_n().  When the connection is established, the socket
structure  is  then  transfered to _s_o__q, making it available
for an accept.

     If an SO_ACCEPTCONN socket is closed  with  sockets  on
either _s_o__q_0 or _s_o__q, these sockets are dropped.

_6._2.  _P_r_o_t_o_c_o_l _l_a_y_e_r(_s)

     Protocols are described by a set of  entry  points  and
certain  socket  visible  characteristics, some of which are
used in deciding which socket type(s) they may support.

     An entry in the ``protocol switch''  table  exists  for
each protocol module configured into the system.  It has the
following form:


Networking Implementation  - 12 -          Internal layering


        struct protosw {
               short     pr_type;             /* socket type used for */
               short     pr_family;           /* protocol family */
               short     pr_protocol;         /* protocol number */
               short     pr_flags;            /* socket visible attributes */
        /* protocol-protocol hooks */
               int       (*pr_input)();       /* input to protocol (from below) */
               int       (*pr_output)();      /* output to protocol (from above) */
               int       (*pr_ctlinput)();    /* control input (from below) */
               int       (*pr_ctloutput)();   /* control output (from above) */
        /* user-protocol hook */
               int       (*pr_usrreq)();      /* user request */
        /* utility hooks */
               int       (*pr_init)();        /* initialization routine */
               int       (*pr_fasttimo)();    /* fast timeout (200ms) */
               int       (*pr_slowtimo)();    /* slow timeout (500ms) */
               int       (*pr_drain)();       /* flush any excess space possible */
        };


     A protocol is called through the _p_r__i_n_i_t  entry  before
any  other.   Thereafter it is called every 200 milliseconds
through the _p_r__f_a_s_t_t_i_m_o entry  and  every  500  milliseconds
through the _p_r__s_l_o_w_t_i_m_o for timer based actions.  The system
will call the _p_r__d_r_a_i_n entry if it is low on space and  this
should throw away any non-critical data.

     Protocols pass data between  themselves  as  chains  of
mbufs  using  the _p_r__i_n_p_u_t and _p_r__o_u_t_p_u_t routines.  _P_r__i_n_p_u_t
passes data up (towards the user) and  _p_r__o_u_t_p_u_t  passes  it
down  (towards  the  network); control information passes up
and down on _p_r__c_t_l_i_n_p_u_t and _p_r__c_t_l_o_u_t_p_u_t.  The  protocol  is
responsible  for  the space occupied by any the arguments to
these entries and must dispose of it.

     The _p_r__u_s_e_r_r_e_q  routine  interfaces  protocols  to  the
socket code and is described below.

     The _p_r__f_l_a_g_s field is constructed  from  the  following
values:

        #define PR_ATOMIC         0x01   /* exchange atomic messages only */
        #define PR_ADDR           0x02   /* addresses given with messages */
        #define PR_CONNREQUIRED   0x04   /* connection required by protocol */
        #define PR_WANTRCVD       0x08   /* want PRU_RCVD calls */
        #define PR_RIGHTS         0x10   /* passes capabilities */

Protocols   which   are   connection-based    specify    the
PR_CONNREQUIRED  flag so that the socket routines will never
attempt to send data before  a  connection  has  been  esta-
blished.   If  the  PR_WANTRCVD flag is set, the socket rou-
tines will notfiy the protocol when  the  user  has  removed
data  from  the  socket's  receive  queue.   This allows the


Networking Implementation  - 13 -          Internal layering


protocol to implement acknowledgement on user  receipt,  and
also  update  windowing  information  based on the amount of
space available in the receive  queue.   The  PR_ADDR  field
indicates any data placed in the socket's receive queue will
be preceded by the address of  the  sender.   The  PR_ATOMIC
flag  specifies  each _u_s_e_r request to send data must be per-
formed  in  a  single  _p_r_o_t_o_c_o_l  send  request;  it  is  the
protocol's  responsibility  to maintain record boundaries on
data to be sent.  The PR_RIGHTS flag indicates the  protocol
supports  the  passing  of  capabilities;  this is currently
used only the protocols in the UNIX protocol family.

     When a socket is created, the socket routines scan  the
protocol  table  looking for an appropriate protocol to sup-
port the type of socket being created.   The  _p_r__t_y_p_e  field
contains   one   of   the   possible   socket   types  (e.g.
SOCK_STREAM), while the _p_r__f_a_m_i_l_y field indicates which pro-
tocol family the protocol belongs to.  The _p_r__p_r_o_t_o_c_o_l field
contains the protocol number of  the  protocol,  normally  a
well known value.

_6._3.  _N_e_t_w_o_r_k-_i_n_t_e_r_f_a_c_e _l_a_y_e_r

     Each network-interface configured into a system defines
a path through which packets may be sent and received.  Nor-
mally a hardware device is associated with  this  interface,
though  there  is  no requirement for this (for example, all
systems have a  software  ``loopback''  interface  used  for
debugging and performance analysis).  In addition to manipu-
lating the hardware device, an interface module is responsi-
ble  for  encapsulation and deencapsulation of any low level
header information required to deliver  a  message  to  it's
destination.   The  selection  of  which interface to use in
delivering packets is a routing decision carried  out  at  a
higher  level than the network-interface layer.  Each inter-
face normally identifies itself at boot time to the  routing
module so that it may be selected for packet delivery.

     An interface is defined by the following structure,


Networking Implementation  - 14 -          Internal layering


        struct ifnet {
               char      *if_name;            /* name, e.g. ``en'' or ``lo'' */
               short     if_unit;             /* sub-unit for lower level driver */
               short     if_mtu;              /* maximum transmission unit */
               int       if_net;              /* network number of interface */
               short     if_flags;            /* up/down, broadcast, etc. */
               short     if_timer;            /* time 'til if_watchdog called */
               int       if_host[2];          /* local net host number */
               struct    sockaddr if_addr;    /* address of interface */
               union {
                         struct               sockaddr ifu_broadaddr;
                         struct               sockaddr ifu_dstaddr;
               } if_ifu;
               struct    ifqueue if_snd;      /* output queue */
               int       (*if_init)();        /* init routine */
               int       (*if_output)();      /* output routine */
               int       (*if_ioctl)();       /* ioctl routine */
               int       (*if_reset)();       /* bus reset routine */
               int       (*if_watchdog)();    /* timer routine */
               int       if_ipackets;         /* packets received on interface */
               int       if_ierrors;          /* input errors on interface */
               int       if_opackets;         /* packets sent on interface */
               int       if_oerrors;          /* output errors on interface */
               int       if_collisions;       /* collisions on csma interfaces */
               struct    ifnet *if_next;
        };


     Each interface has a send queue and routines  used  for
initialization,  _i_f__i_n_i_t,  and  output,  _i_f__o_u_t_p_u_t.   If the
interface resides on a system bus, the routine _i_f__r_e_s_e_t will
be called after a bus reset has been performed. An interface
may also specify a timer routine, _i_f__w_a_t_c_h_d_o_g, which  should
be called every _i_f__t_i_m_e_r seconds (if non-zero).

     The state of an interface and  certain  characteristics
are  stored in the _i_f__f_l_a_g_s field.  The following values are
possible:

        #define IFF_UP            0x1    /* interface is up */
        #define IFF_BROADCAST     0x2    /* broadcast address valid */
        #define IFF_DEBUG         0x4    /* turn on debugging */
        #define IFF_ROUTE         0x8    /* routing entry installed */
        #define IFF_POINTOPOINT   0x10   /* interface is point-to-point link */
        #define IFF_NOTRAILERS    0x20   /* avoid use of trailers */
        #define IFF_RUNNING       0x40   /* resources allocated */
        #define IFF_NOARP         0x80   /* no address resolution protocol */

If the interface is connected to a  network  which  supports
transmission  of  _b_r_o_a_d_c_a_s_t  packets, the IFF_BROADCAST flag
will be set and the  _i_f__b_r_o_a_d_a_d_d_r  field  will  contain  the
address  to  be  used  in  sending  or accepting a broadcast
packet.  If the interface is  associated  with  a  point  to


Networking Implementation  - 15 -          Internal layering


point  hardware  link  (for  example,  a  DEC  DMR-11),  the
IFF_POINTOPOINT flag will be set and _i_f__d_s_t_a_d_d_r will contain
the address of the host on the other side of the connection.
These addresses and the  local  address  of  the  interface,
_i_f__a_d_d_r, are used in filtering incoming packets.  The inter-
face  sets  IFF_RUNNING  after  it  has   allocated   system
resources  and  posted  an  initial  read  on  the device it
manages.  This state bit is used to avoid  multiple  alloca-
tion  requests  when an interface's address is changed.  The
IFF_NOTRAILERS flag indicates the interface  should  refrain
from  using  a  _t_r_a_i_l_e_r  encapsulation  on outgoing packets;
_t_r_a_i_l_e_r  protocols  are  described  in  section   14.    The
IFF_NOARP  flag  indicates  the  interface should not use an
``address  resolution  protocol''  in  mapping  internetwork
addresses to local network addresses.

     The information stored in an _i_f_n_e_t structure for  point
to  point communication devices is not currently used by the
system internally.  Rather, it is used  by  the  user  level
routing  process in determining host network connections and
in initially devising routes (refer to chapter 10  for  more
information).

     Various statistics are also  stored  in  the  interface
structure.    These   may  be  viewed  by  users  using  the
_n_e_t_s_t_a_t(1) program.

     The interface address and flags may  be  set  with  the
SIOCSIFADDR and SIOCSIFFLAGS ioctls.  SIOCSIFADDR is used to
initially define each interface's address; SIOGSIFFLAGS  can
be  used to mark an interface down and perform site-specific
configuration.

_6._3._1.  _U_N_I_B_U_S _i_n_t_e_r_f_a_c_e_s

     All hardware related interfaces currently reside on the
UNIBUS.   Consequently  a common set of utility routines for
dealing with the UNIBUS has  been  developed.   Each  UNIBUS
interface utilizes a structure of the following form:


Networking Implementation  - 16 -          Internal layering


        struct ifuba {
               short     ifu_uban;            /* uba number */
               short     ifu_hlen;            /* local net header length */
               struct    uba_regs *ifu_uba;   /* uba regs, in vm */
               struct ifrw {
                         caddr_t   ifrw_addr; /* virt addr of header */
                         int       ifrw_bdp;  /* unibus bdp */
                         int       ifrw_info; /* value from ubaalloc */
                         int       ifrw_proto;/* map register prototype */
                         struct    pte *ifrw_mr;/* base of map registers */
               } ifu_r, ifu_w;
               struct    pte ifu_wmap[IF_MAXNUBAMR];/* base pages for output */
               short     ifu_xswapd;          /* mask of clusters swapped */
               short     ifu_flags;           /* used during uballoc's */
               struct    mbuf *ifu_xtofree;   /* pages being dma'd out */
        };


     The _i_f__u_b_a structure describes UNIBUS resources held by
an interface.  IF_NUBAMR map registers are held for datagram
data, starting at _i_f_r__m_r.  UNIBUS  map  register  _i_f_r__m_r[-1]
maps  the  local  network  header ending on a page boundary.
UNIBUS data paths are reserved for read and for write, given
by _i_f_r__b_d_p.  The prototype of the map registers for read and
for write is saved in _i_f_r__p_r_o_t_o.

     When write transfers are not full pages on  page  boun-
daries  the data is just copied into the pages mapped on the
UNIBUS and the transfer is started.  If a write transfer  is
of  a  (1024  byte) page size and on a page boundary, UNIBUS
page table entries are swapped to reference the  pages,  and
then  the  initial pages are remapped from _i_f_u__w_m_a_p when the
transfer completes.

     When read transfers give whole  pages  of  data  to  be
input,  page  frames  are allocated from a network page list
and traded with the pages already containing the data,  map-
ping  the allocated pages to replace the input pages for the
next UNIBUS data input.

     The following utility routines are available for use in
writing  network interface drivers, all use the _i_f_u_b_a struc-
ture described above.

if_ubainit(ifu, uban, hlen, nmr);
     _i_f__u_b_a_i_n_i_t allocates resources on UNIBUS  adaptor  _u_b_a_n
     and  stores  the  resultant  information  in  the _i_f_u_b_a
     structure pointed to by _i_f_u. It is called only at  boot
     time  or after a UNIBUS reset. Two data paths (buffered
     or unbuffered, depending on the  _i_f_u__f_l_a_g_s  field)  are
     allocated,  one  for  reading and one for writing.  The
     _n_m_r parameter indicates the number  of  UNIBUS  mapping
     registers  required  to map a maximal sized packet onto


Networking Implementation  - 17 -          Internal layering


     the UNIBUS, while _h_l_e_n specifies the size  of  a  local
     network   header,   if  any,  which  should  be  mapped
     separately  from  the  data  (see  the  description  of
     trailer  protocols  in  chapter 14).  Sufficient UNIBUS
     mapping registers and pages of memory are allocated  to
     initialize  the  input  data  path for an initial read.
     For the output data path, mapping registers  and  pages
     of  memory  are  also  allocated  and  mapped  onto the
     UNIBUS.  The pages associated with the output data path
     are held in reserve in the event a write requires copy-
     ing non-page-aligned data (see _i_f__w_u_b_a_p_u_t  below).   If
     _i_f__u_b_a_i_n_i_t  is called with resources already allocated,
     they will be used instead of allocating new ones  (this
     normally occurs after a UNIBUS reset).  A 1 is returned
     when allocation and  initialization  is  successful,  0
     otherwise.

m = if_rubaget(ifu, totlen, off0);
     _i_f__r_u_b_a_g_e_t pulls read data off  an  interface.   _t_o_t_l_e_n
     specifies the length of data to be obtained, not count-
     ing the local network header.  If _o_f_f_0 is non-zero,  it
     indicates  a  byte  offset  to a trailing local network
     header which should be copied into a separate mbuf  and
     prepended  to  the  front  of the resultant mbuf chain.
     When page sized units  of  data  are  present  and  are
     page-aligned,  the  previously  mapped  data  pages are
     remapped into the mbufs and swapped with  fresh  pages;
     thus  avoiding any copying.  A 0 return value indicates
     a failure to allocate resources.

if_wubaput(ifu, m);
     _i_f__w_u_b_a_p_u_t maps a chain of mbufs onto a network  inter-
     face in preparation for output.  The chain includes any
     local network  header,  which  is  copied  so  that  it
     resides in the mapped and aligned I/O space.  Any other
     mbufs which contained non page sized data portions  are
     also copied to the I/O space.  Pages mapped from a pre-
     vious output operation (no longer needed) are  unmapped
     and returned to the network page pool.


Networking Implementation  - 18 -  Socket/protocol interface


_7.  _S_o_c_k_e_t/_p_r_o_t_o_c_o_l _i_n_t_e_r_f_a_c_e

     The interface between the socket routines and the  com-
munication   protocols  is  through  the  _p_r__u_s_r_r_e_q  routine
defined  in  the  protocol  switch  table.   The   following
requests to a protocol module are possible:

        #define PRU_ATTACH        0      /* attach protocol */
        #define PRU_DETACH        1      /* detach protocol */
        #define PRU_BIND          2      /* bind socket to address */
        #define PRU_LISTEN        3      /* listen for connection */
        #define PRU_CONNECT       4      /* establish connection to peer */
        #define PRU_ACCEPT        5      /* accept connection from peer */
        #define PRU_DISCONNECT    6      /* disconnect from peer */
        #define PRU_SHUTDOWN      7      /* won't send any more data */
        #define PRU_RCVD          8      /* have taken data; more room now */
        #define PRU_SEND          9      /* send this data */
        #define PRU_ABORT         10     /* abort (fast DISCONNECT, DETATCH) */
        #define PRU_CONTROL       11     /* control operations on protocol */
        #define PRU_SENSE         12     /* return status into m */
        #define PRU_RCVOOB        13     /* retrieve out of band data */
        #define PRU_SENDOOB       14     /* send out of band data */
        #define PRU_SOCKADDR      15     /* fetch socket's address */
        #define PRU_PEERADDR      16     /* fetch peer's address */
        #define PRU_CONNECT2      17     /* connect two sockets */
        /* begin for protocols internal use */
        #define PRU_FASTTIMO      18     /* 200ms timeout */
        #define PRU_SLOWTIMO      19     /* 500ms timeout */
        #define PRU_PROTORCV      20     /* receive from below */
        #define PRU_PROTOSEND     21     /* send to below */

A call on the user request routine is of the form,

        error = (*protosw[].pr_usrreq)(up, req, m, addr, rights);
        int error; struct socket *up; int req; struct mbuf *m, *rights; caddr_t addr;

The mbuf chain, _m, and the address are optional  parameters.
The _r_i_g_h_t_s parameter is an optional pointer to an mbuf chain
containing user specified capabilities (see the _s_e_n_d_m_s_g  and
_r_e_c_v_m_s_g  system  calls).   The  protocol  is responsible for
disposal of both mbuf chains.  A non-zero return value gives
a  UNIX  error number which should be passed to higher level
software.  The following paragraphs  describe  each  of  the
requests possible.

PRU_ATTACH
     When a protocol is bound to a socket (with  the  _s_o_c_k_e_t
     system  call)  the  protocol module is called with this
     request.  It is  the  responsibility  of  the  protocol
     module   to  allocate  any  resources  necessary.   The
     ``attach'' request will always precede any of the other
     requests, and should not occur more than once.

PRU_DETACH


Networking Implementation  - 19 -  Socket/protocol interface


     This is the antithesis of the attach  request,  and  is
     used  at  the  time  a socket is deleted.  The protocol
     module may deallocate any  resources  assigned  to  the
     socket.

PRU_BIND
     When a socket is initially created it  has  no  address
     bound  to it.  This request indicates an address should
     be bound to an existing socket.   The  protocol  module
     must  verify  the requested address is valid and avail-
     able for use.

PRU_LISTEN
     The ``listen'' request indicates  the  user  wishes  to
     listen  for incoming connection requests on the associ-
     ated socket.  The protocol module  should  perform  any
     state changes needed to carry out this request (if pos-
     sible).  A ``listen'' request always precedes a request
     to accept a connection.

PRU_CONNECT
     The ``connect'' request indicates the user wants  to  a
     establish  an association.  The _a_d_d_r parameter supplied
     describes the peer to be connected to.  The effect of a
     connect  request  may  vary  depending on the protocol.
     Virtual circuit protocols, such as TCP [Postel80b], use
     this request to initiate establishment of a TCP connec-
     tion.  Datagram protocols, such as UDP [Postel79], sim-
     ply  record the peer's address in a private data struc-
     ture and use it to tag all outgoing packets.  There are
     no restrictions on how many times a connect request may
     be used after an attach.  If a  protocol  supports  the
     notion of _m_u_l_t_i-_c_a_s_t_i_n_g, it is possible to use multiple
     connects to establish  a  multi-cast  group.   Alterna-
     tively,   an   association   may   be   broken   by   a
     PRU_DISCONNECT request, and a new  association  created
     with a subsequent connect request; all without destroy-
     ing and creating a new socket.

PRU_ACCEPT
     Following  a  successful  PRU_LISTEN  request  and  the
     arrival  of  one  or  more connections, this request is
     made to indicate the user has accepted the  first  con-
     nection  on the queue of pending connections.  The pro-
     tocol module should fill in the supplied address buffer
     with the address of the connected party.

PRU_DISCONNECT
     Eliminate an association  created  with  a  PRU_CONNECT
     request.

PRU_SHUTDOWN
     This call is used to indicate no more data will be sent
     and/or  received  (the  _a_d_d_r  parameter  indicates  the


Networking Implementation  - 20 -  Socket/protocol interface


     direction of the shutdown, as encoded in the _s_o_s_h_u_t_d_o_w_n
     system  call).   The  protocol  may, at its discretion,
     deallocate any data structures related to the shutdown.

PRU_RCVD
     This request is made only if the protocol entry in  the
     protocol  switch  table  includes the PR_WANTRCVD flag.
     When a user removes data from the  receive  queue  this
     request will be sent to the protocol module.  It may be
     used to  trigger  acknowledgements,  refresh  windowing
     information, initiate data transfer, etc.

PRU_SEND
     Each user request to send data is translated  into  one
     or  more  PRU_SEND  requests (a protocol may indicate a
     single user send request must be translated into a sin-
     gle  PRU_SEND  request by specifying the PR_ATOMIC flag
     in its protocol description).  The data to be  sent  is
     presented  to  the  protocol  as a list of mbufs and an
     address is, optionally, supplied in the _a_d_d_r parameter.
     The  protocol is responsible for preserving the data in
     the socket's send queue if it is not able  to  send  it
     immediately,  or  if  it may need it at some later time
     (e.g. for retransmission).

PRU_ABORT
     This request indicates an abnormal termination of  ser-
     vice.    The   protocol   should  delete  any  existing
     association(s).

PRU_CONTROL
     The ``control'' request is generated when a  user  per-
     forms  a  UNIX  _i_o_c_t_l  system call on a socket (and the
     ioctl is not intercepted by the socket  routines).   It
     allows protocol-specific operations to be provided out-
     side the scope of the  common  socket  interface.   The
     _a_d_d_r  parameter  contains  a pointer to a static kernel
     data area where relevant information may be obtained or
     returned.   The  _m  parameter contains the actual _i_o_c_t_l
     request code (note  the  non-standard  calling  conven-
     tion).

PRU_SENSE
     The ``sense'' request is generated when the user  makes
     an _f_s_t_a_t system call on a socket; it requests status of
     the associated socket. There  currently  is  no  common
     format for the status returned. Information which might
     be returned includes per-connection statistics,  proto-
     col  state,  resources  currently in use by the connec-
     tion, the optimal  transfer  size  for  the  connection
     (based  on  windowing  information  and  maximum packet
     size).  The _a_d_d_r parameter  contains  a  pointer  to  a
     static  kernel data area where the status buffer should
     be placed.


Networking Implementation  - 21 -  Socket/protocol interface


PRU_RCVOOB
     Any ``out-of-band'' data presently available is  to  be
     returned.   An mbuf is passed in to the protocol module
     and the protocol should either place data in  the  mbuf
     or  attach  new  mbufs  to the one supplied if there is
     insufficient space in the single mbuf.

PRU_SENDOOB
     Like PRU_SEND, but for out-of-band data.

PRU_SOCKADDR
     The local address of the socket is returned, if any  is
     currently  bound to the it.  The address format (proto-
     col specific) is returned in the _a_d_d_r parameter.

PRU_PEERADDR
     The address of the peer to which  the  socket  is  con-
     nected   is   returned.    The  socket  must  be  in  a
     SS_ISCONNECTED state for this request to be made to the
     protocol.   The  address  format (protocol specific) is
     returned in the _a_d_d_r parameter.

PRU_CONNECT2
     The  protocol  module  is  supplied  two  sockets   and
     requested  to  establish  a  connection between the two
     without binding any addresses, if possible.  This  call
     is used in implementing the system call.

     The following requests are used internally by the  pro-
tocol  modules  and  are  never generated by the socket rou-
tines.   In  certain  instances,  they  are  handed  to  the
_p_r__u_s_r_r_e_q  routine  solely  for  convenience  in  tracing  a
protocol's operation (e.g. PRU_SLOWTIMO).

PRU_FASTTIMO
     A ``fast timeout'' has occured.  This request  is  made
     when a timeout occurs in the protocol's _p_r__f_a_s_t_i_m_o rou-
     tine.   The  _a_d_d_r  parameter  indicates   which   timer
     expired.

PRU_SLOWTIMO
     A ``slow timeout'' has occured.  This request  is  made
     when  a  timeout  occurs  in the protocol's _p_r__s_l_o_w_t_i_m_o
     routine.  The  _a_d_d_r  parameter  indicates  which  timer
     expired.

PRU_PROTORCV
     This request is used in  the  protocol-protocol  inter-
     face,  not  by  the routines.  It requests reception of
     data destined for the protocol and not  the  user.   No
     protocols currently use this facility.

PRU_PROTOSEND
     This request allows a protocol to  send  data  destined


Networking Implementation  - 22 -  Socket/protocol interface


     for  another  protocol module, not a user.  The details
     of how data is marked ``addressed to protocol'' instead
     of  ``addressed  to  user''  are  left  to the protocol
     modules.  No protocols currently use this facility.


Networking Implementation  - 23 -Protocol/protocol interface


_8.  _P_r_o_t_o_c_o_l/_p_r_o_t_o_c_o_l _i_n_t_e_r_f_a_c_e

     The interface between protocol modules is  through  the
_p_r__u_s_r_r_e_q,    _p_r__i_n_p_u_t,    _p_r__o_u_t_p_u_t,    _p_r__c_t_l_i_n_p_u_t,    and
_p_r__c_t_l_o_u_t_p_u_t routines.  The calling conventions for all  but
the  _p_r__u_s_r_r_e_q  routine  are  expected to be specific to the
protocol modules and are not  guaranteed  to  be  consistent
across  protocol  families.  We will examine the conventions
used for some of the Internet protocols in this  section  as
an example.

_8._1.  _p_r__o_u_t_p_u_t

     The Internet protocol UDP uses the convention,

        error = udp_output(inp, m);
        int error; struct inpcb *inp; struct mbuf *m;

where the _i_n_p, ``_i_nternet _protocol _control  _block'',  passed
between  modules  conveys  per connection state information,
and the mbuf chain contains the data to be sent.   UDP  per-
forms  consistency  checks, appends its header, calculates a
checksum, etc. before  passing  the  packet  on  to  the  IP
module:

        error = ip_output(m, opt, ro, allowbroadcast);
        int error; struct mbuf *m, *opt; struct route *ro; int allowbroadcast;


     The call to IP's output  routine  is  more  complicated
than  that  for  UDP,  as  befits the additional work the IP
module must do.  The _m parameter is the data to be sent, and
the  _o_p_t  parameter  is an optional list of IP options which
should be placed in the IP packet header.  The _r_o  parameter
is  is  used  in  making routing decisions (and passing them
back to the caller).  The final parameter, _a_l_l_o_w_b_r_o_a_d_c_a_s_t is
a  flag  indicating  if  the  user  is allowed to transmit a
broadcast packet.  This may be inconsequential if the under-
lying hardware does not support the notion of broadcasting.

     All output routines return 0  on  success  and  a  UNIX
error number if a failure occured which could be immediately
detected (no buffer space available, no  route  to  destina-
tion, etc.).

_8._2.  _p_r__i_n_p_u_t

     Both UDP and TCP use the following calling convention,

        (void) (*protosw[].pr_input)(m);
        struct mbuf *m;

Each mbuf list passed is a single packet to be processed  by
the protocol module.


Networking Implementation  - 24 -Protocol/protocol interface


     The IP input routine is a VAX software interrupt  level
routine,  and  so  is  not  called  with any parameters.  It
instead  communicates  with  network  interfaces  through  a
queue,  _i_p_i_n_t_r_q,  which  is  identical  in  structure to the
queues used by the network interfaces  for  storing  packets
awaiting transmission.

_8._3.  _p_r__c_t_l_i_n_p_u_t

     This routine is used to convey ``control''  information
to a protocol module (i.e. information which might be passed
to the user, but  is  not  data).   This  routine,  and  the
_p_r__c_t_l_o_u_t_p_u_t  routine,  have not been extensively developed,
and thus suffer from  a  ``clumsiness''  that  can  only  be
improved as more demands are placed on it.

     The common calling convention for this routine is,

        (void) (*protosw[].pr_ctlinput)(req, info);
        int req; caddr_t info;

The _r_e_q parameter is one of the following,

        #define PRC_IFDOWN             0      /* interface transition */
        #define PRC_ROUTEDEAD          1      /* select new route if possible */
        #define PRC_QUENCH             4      /* some said to slow down */
        #define PRC_HOSTDEAD           6      /* normally from IMP */
        #define PRC_HOSTUNREACH        7      /* ditto */
        #define PRC_UNREACH_NET        8      /* no route to network */
        #define PRC_UNREACH_HOST       9      /* no route to host */
        #define PRC_UNREACH_PROTOCOL   10     /* dst says bad protocol */
        #define PRC_UNREACH_PORT       11     /* bad port # */
        #define PRC_MSGSIZE            12     /* message size forced drop */
        #define PRC_REDIRECT_NET       13     /* net routing redirect */
        #define PRC_REDIRECT_HOST      14     /* host routing redirect */
        #define PRC_TIMXCEED_INTRANS   17     /* packet lifetime expired in transit */
        #define PRC_TIMXCEED_REASS     18     /* lifetime expired on reass q */
        #define PRC_PARAMPROB          19     /* header incorrect */

while the _i_n_f_o parameter is a ``catchall''  value  which  is
request dependent.  Many of the requests have obviously been
derived from ICMP (the Internet Control  Message  Protocol),
and from error messages defined in the 1822 host/IMP conven-
tion [BBN78].   Mapping  tables  exist  to  convert  control
requests to UNIX error codes which are delivered to a user.

_8._4.  _p_r__c_t_l_o_u_t_p_u_t

     This routine is not  currently  used  by  any  protocol
modules.


Networking Implementation  - 25 - Protocol/network-interface


_9.  _P_r_o_t_o_c_o_l/_n_e_t_w_o_r_k-_i_n_t_e_r_f_a_c_e _i_n_t_e_r_f_a_c_e

     The lowest layer in the set of protocols which comprise
a  protocol family must interface itself to one or more net-
work interfaces in order to transmit  and  receive  packets.
It  is  assumed  that  any  routing decisions have been made
before handing a packet to a network interface, in fact this
is  absolutely necessary in order to locate any interface at
all (unless, of course,  one  uses  a  single  ``hardwired''
interface).   There  are  two  cases  to  be concerned with,
transmission of a packet, and receipt of a packet; each will
be considered separately.

_9._1.  _P_a_c_k_e_t _t_r_a_n_s_m_i_s_s_i_o_n

     Assuming a protocol has a handle on an interface,  _i_f_p,
a  (struct  ifnet  *), it transmits a fully formatted packet
with the following call,

        error = (*ifp->if_output)(ifp, m, dst)
        int error; struct ifnet *ifp; struct mbuf *m; struct sockaddr *dst;

The output routine for the network interface  transmits  the
packet  _m to the _d_s_t address, or returns an error indication
(a UNIX error number).  In reality transmission may  not  be
immediate, or successful; normally the output routine simply
queues the packet on its send queue and primes an  interrupt
driven routine to actually transmit the packet.  For unreli-
able mediums, such as the Ethernet, ``successful'' transmis-
sion  simply  means  the packet has been placed on the cable
without a collision.  On the other hand, an  1822  interface
guarantees  proper  delivery or an error indication for each
message transmitted.  The model employed in  the  networking
system  attaches  no  promises  of  delivery  to the packets
handed to a network interface,  and  thus  corresponds  more
closely to the Ethernet.  Errors returned by the output rou-
tine are  normally  trivial  in  nature  (no  buffer  space,
address format not handled, etc.).

_9._2.  _P_a_c_k_e_t _r_e_c_e_p_t_i_o_n

     Each protocol family must have  one  or  more  ``lowest
level''  protocols.   These protocols deal with internetwork
addressing and are responsible for the delivery of  incoming
packets  to  the proper protocol processing modules.  In the
PUP model [Boggs78] these protocols are termed Level 1  pro-
tocols,  in  the ISO model, network layer protocols.  In our
system each such protocol module has an input  packet  queue
assigned  to  it.   Incoming  packets  received by a network
interface are queued up for the protocol module  and  a  VAX
software interrupt is posted to initiate processing.

     Three macros are available for queueing and  dequeueing
packets,


Networking Implementation  - 26 - Protocol/network-interface


IF_ENQUEUE(ifq, m)
     This places the packet _m at the tail of the queue _i_f_q.

IF_DEQUEUE(ifq, m)
     This places a pointer to the  packet  at  the  head  of
     queue  _i_f_q in _m.  A zero value will be returned in _m if
     the queue is empty.

IF_PREPEND(ifq, m)
     This places the packet _m at the head of the queue _i_f_q.

     Each queue has a maximum length associated with it as a
simple  form of congestion control.  The macro IF_QFULL(ifq)
returns 1 if the queue is filled, in which  case  the  macro
IF_DROP(ifq) should be used to bump a count of the number of
packets dropped and the offending packet dropped.  For exam-
ple, the following code fragment is commonly found in a net-
work interface's input routine,

        if (IF_QFULL(inq)) {
               IF_DROP(inq);
               m_freem(m);
        } else
               IF_ENQUEUE(inq, m);


Networking Implementation  - 27 -       Gateways and routing


_1_0.  _G_a_t_e_w_a_y_s _a_n_d _r_o_u_t_i_n_g _i_s_s_u_e_s

     The system has been designed with the expectation  that
it  will  be  used  in  an  internetwork  environment.   The
``canonical'' environment was envisioned to be a  collection
of  local  area  networks  connected  at  one or more points
through hosts with multiple network interfaces (one on  each
local  area  network),  and  possibly a connection to a long
haul  network  (for  example,  the  ARPANET).   In  such  an
environment,  issues of gatewaying and packet routing become
very important.  Certain of these issues, such as congestion
control, have been handled in a simplistic manner or specif-
ically not addressed.  Instead, where possible, the  network
system attempts to provide simple mechanisms upon which more
involved policies may be  implemented.   As  some  of  these
problems  become  better understood, the solutions developed
will be incorporated into the system.

     This section will describe the facilities provided  for
packet  routing.   The  simplistic  mechanisms  provided for
congestion control are described in chapter 12.

_1_0._1.  _R_o_u_t_i_n_g _t_a_b_l_e_s

     The network system maintains a set  of  routing  tables
for  selecting  a  network  interface to use in delivering a
packet to its destination.  These tables are of the form:

        struct rtentry {
               u_long    rt_hash;             /* hash key for lookups */
               struct    sockaddr rt_dst;     /* destination net or host */
               struct    sockaddr rt_gateway; /* forwarding agent */
               short     rt_flags;            /* see below */
               short     rt_refcnt;           /* no. of references to structure */
               u_long    rt_use;              /* packets sent using route */
               struct    ifnet *rt_ifp;       /* interface to give packet to */
        };


     The routing information is organized  in  two  separate
tables,  one  for  routes  to a host and one for routes to a
network.  The distinction  between  hosts  and  networks  is
necessary  so  that  a single mechanism may be used for both
broadcast and multi-drop type networks, and  also  for  net-
works built from point-to-point links (e.g DECnet [DEC80]).

     Each table is organized  as  a  hashed  set  of  linked
lists.   Two  32-bit  hash values are calculated by routines
defined for each address family; one based on  the  destina-
tion  being  a host, and one assuming the target is the net-
work portion of the address.  Each hash  value  is  used  to
locate  a  hash  chain to search (by taking the value modulo
the hash table size) and the entire  32-bit  value  is  then
used  as  a key in scanning the list of routes.  Lookups are


Networking Implementation  - 28 -       Gateways and routing


applied first to the routing table for hosts,  then  to  the
routing  table  for networks.  If both lookups fail, a final
lookup is made for a ``wildcard'' route (by convention, net-
work 0).  By doing this, routes to a specific host on a net-
work may be present as well as routes to the network.   This
also  allows  a ``fall back'' network route to be defined to
an ``smart'' gateway which may then perform more intelligent
routing.

     Each routing table entry contains a destination  (who's
at the other end of the route), a gateway to send the packet
to, and various flags which indicate the route's status  and
type  (host  or  network).  A count of the number of packets
sent using the route is kept for  use  in  deciding  between
multiple  routes  to the same destination (see below), and a
count of ``held references'' to  the  dynamically  allocated
structure  is maintained to insure memory reclamation occurs
only when the route is not in use.  Finally a pointer to the
a  network  interface  is kept; packets sent using the route
should be handed to this interface.

     Routes are typed in two ways: either as  host  or  net-
work,  and  as ``direct'' or ``indirect''.  The host/network
distinction determines how to compare the _r_t__d_s_t field  dur-
ing  lookup.   If the route is to a network, only a packet's
destination network is compared to the _r_t__d_s_t  entry  stored
in the table.  If the route is to a host, the addresses must
match bit for bit.

     The distinction  between  ``direct''  and  ``indirect''
routes  indicates  whether  the destination is directly con-
nected to the source.  This is needed when performing  local
network  encapsulation.   If a packet is destined for a peer
at a host or network which is not directly connected to  the
source,  the  internetwork  packet  header will indicate the
address of the eventual destination, while the local network
header will indicate the address of the intervening gateway.
Should  the  destination  be   directly   connected,   these
addresses  are  likely to be identical, or a mapping between
the two exists.  The RTF_GATEWAY flag indicates the route is
to  an  ``indirect''  gateway  agent  and  the local network
header should be filled in from the _r_t__g_a_t_e_w_a_y field instead
of _r_t__d_s_t, or from the internetwork destination address.

     It is assumed multiple routes to the  same  destination
will  not  be  present  unless they are deemed _e_q_u_a_l in cost
(the current routing policy process never installs  multiple
routes  to  the same destination).  However, should multiple
routes to the same destination exist, a request for a  route
will  return  the  ``least  used''  route based on the total
number of packets sent along this route.  This can result in
a  ``ping-pong''  effect (alternate packets taking alternate
routes), unless protocols ``hold onto'' routes until they no
longer find them useful;  either because the destination has


Networking Implementation  - 29 -       Gateways and routing


changed, or because the route is lossy.

     Routing redirect control messages are used  to  dynami-
cally  modify  existing  routing  table  entries  as well as
dynamically create new  routing  table  entries.   On  hosts
where  exhaustive  routing  information  is too expensive to
maintain (e.g. work stations), the combination  of  wildcard
routing entries and routing redirect messages can be used to
provide a simple routing management scheme without  the  use
of a higher level policy process. Statistics are kept by the
routing table routines on the use of routing  redirect  mes-
sages and their affect on the routing tables.  These statis-
tics may be viewed using

     Status information other than routing redirect  control
messages  may be used in the future, but at present they are
ignored.  Likewise, more intelligent ``metrics'' may be used
to   describe  routes  in  the  future,  possibly  based  on
bandwidth and monetary costs.

_1_0._2.  _R_o_u_t_i_n_g _t_a_b_l_e _i_n_t_e_r_f_a_c_e

     A protocol accesses the routing  tables  through  three
routines,  one to allocate a route, one to free a route, and
one to process a routing redirect control message.  The rou-
tine  _r_t_a_l_l_o_c performs route allocation; it is called with a
pointer to the following structure,

        struct route {
               struct    rtentry *ro_rt;
               struct    sockaddr ro_dst;
        };

The route returned is assumed ``held'' by the  caller  until
disposed  of with an _r_t_f_r_e_e call.  Protocols which implement
virtual circuits, such as TCP,  hold  onto  routes  for  the
duration  of  the  circuit's lifetime, while connection-less
protocols, such as UDP, currently allocate and  free  routes
on each transmission.

     The routine _r_t_r_e_d_i_r_e_c_t is called to process  a  routing
redirect  control  message.  It is called with a destination
address and the new gateway to that destination.  If a  non-
wildcard  route exists to the destination, the gateway entry
in the route is modified to point at the  new  gateway  sup-
plied.   Otherwise,  a  new  routing table entry is inserted
reflecting the information supplied.  Routes  to  interfaces
and routes to gateways which are not directly accesible from
the host are ignored.

_1_0._3.  _U_s_e_r _l_e_v_e_l _r_o_u_t_i_n_g _p_o_l_i_c_i_e_s

     Routing policies implemented in user processes  manipu-
late the kernel routing tables through two _i_o_c_t_l calls.  The


Networking Implementation  - 30 -       Gateways and routing


commands SIOCADDRT and  SIOCDELRT  add  and  delete  routing
entries,  respectively;  the  tables  are  read  through the
/dev/kmem device.  The decision to place policy decisions in
a  user  process implies routing table updates may lag a bit
behind the identification of new routes, or the  failure  of
existing  routes, but this period of instability is normally
very small with proper implementation of  the  routing  pro-
cess.  Advisory information, such as ICMP error messages and
IMP diagnostic  messages,  may  be  read  from  raw  sockets
(described in the next section).

     One routing policy  process  has  already  been  imple-
mented.  The system standard ``routing daemon'' uses a vari-
ant of the Xerox NS Routing Information  Protocol  [Xerox82]
to  maintain up to date routing tables in our local environ-
ment.  Interaction with other  existing  routing  protocols,
such  as the Internet GGP (Gateway-Gateway Protocol), may be
accomplished using a similar process.


Networking Implementation  - 31 -                Raw sockets


_1_1.  _R_a_w _s_o_c_k_e_t_s

     A raw socket is a mechanism which allows  users  direct
access  to a lower level protocol.  Raw sockets are intended
for knowledgeable processes which wish to take advantage  of
some  protocol  feature  not directly accessible through the
normal interface, or for the development  of  new  protocols
built  atop  existing lower level protocols.  For example, a
new version of TCP might be developed at the user  level  by
utilizing  a raw IP socket for delivery of packets.  The raw
IP socket interface attempts to provide an identical  inter-
face to the one a protocol would have if it were resident in
the kernel.

     The raw socket support is built around  a  generic  raw
socket  interface,  and  (possibly)  augmented  by protocol-
specific processing routines.  This  section  will  describe
the core of the raw socket interface.

_1_1._1.  _C_o_n_t_r_o_l _b_l_o_c_k_s

     Every raw socket has a protocol control  block  of  the
following form,

        struct rawcb {
               struct    rawcb *rcb_next;       /* doubly linked list */
               struct    rawcb *rcb_prev;
               struct    socket *rcb_socket;    /* back pointer to socket */
               struct    sockaddr rcb_faddr;    /* destination address */
               struct    sockaddr rcb_laddr;    /* socket's address */
               caddr_t   rcb_pcb;               /* protocol specific stuff */
               short     rcb_flags;
        };

All the control blocks are kept on a doubly linked list  for
performing lookups during packet dispatch.  Associations may
be recorded in the control block and used by the output rou-
tine  in  preparing packets for transmission.  The addresses
are also used to filter  packets  on  input;  this  will  be
described  in more detail shortly.  If any protocol specific
information is required, it may be attached to  the  control
block using the _r_c_b__p_c_b field.

     A raw socket interface is datagram oriented.  That  is,
each  send  or  receive on the socket requires a destination
address.  This address may be supplied by the user or stored
in the control block and automatically installed in the out-
going packet by the output routine.  Since it is not  possi-
ble to determine whether an address is present or not in the
control block, two flags, RAW_LADDR and RAW_FADDR,  indicate
if  a  local and foreign address are present.  Another flag,
RAW_DONTROUTE, indicates if routing should be  performed  on
outgoing packets.  If it is, a route is expected to be allo-
cated for each ``new'' destination address.   That  is,  the


Networking Implementation  - 32 -                Raw sockets


first  time  a  packet is transmitted a route is determined,
and thereafter each time the destination address  stored  in
_r_c_b__r_o_u_t_e  differs  from  _r_c_b__f_a_d_d_r,  or  _r_c_b__r_o_u_t_e._r_o__r_t is
zero, the old route is discarded and a new one allocated.

_1_1._2.  _I_n_p_u_t _p_r_o_c_e_s_s_i_n_g

     Input packets are ``assigned'' to raw sockets based  on
a simple pattern matching scheme.  Each network interface or
protocol gives packets to the raw  input  routine  with  the
call:

        raw_input(m, proto, src, dst)
        struct mbuf *m; struct sockproto *proto, struct sockaddr *src, *dst;

The data packet then has a generic header prepended to it of
the form

        struct raw_header {
               struct    sockproto raw_proto;
               struct    sockaddr raw_dst;
               struct    sockaddr raw_src;
        };

and it is placed in a packet queue for the ``raw input  pro-
tocol''  module.   Packets  taken from this queue are copied
into any raw sockets that match the header according to  the
following rules,

1)   The protocol family of the socket and header agree.

2)   If the protocol number in the socket is non-zero,  then
     it agrees with that found in the packet header.

3)   If a local address  is  defined  for  the  socket,  the
     address  format of the local address is the same as the
     destination address's and the two addresses  agree  bit
     for bit.

4)   The rules of 3) are applied  to  the  socket's  foreign
     address and the packet's source address.

A basic assumption is that addresses present in the  control
block  and  packet  header  (as  constructed  by the network
interface and any raw input protocol module) are in a canon-
ical form which may be ``block compared''.

_1_1._3.  _O_u_t_p_u_t _p_r_o_c_e_s_s_i_n_g

     On output the raw _p_r__u_s_r_r_e_q routine passes  the  packet
and raw control block to the raw protocol output routine for
any processing  required  before  it  is  delivered  to  the
appropriate  network  interface.  The output routine is nor-
mally the only code  required  to  implement  a  raw  socket


Networking Implementation  - 33 -                Raw sockets


interface.


Networking Implementation  - 34 -Buffering and congestion control


_1_2.  _B_u_f_f_e_r_i_n_g _a_n_d _c_o_n_g_e_s_t_i_o_n _c_o_n_t_r_o_l

     One of the major factors in the performance of a proto-
col  is the buffering policy used.  Lack of a proper buffer-
ing policy can force packets to be dropped, cause  falsified
windowing  information  to be emitted by protocols, fragment
host memory, degrade the overall host performance, etc.  Due
to  problems  such  as  these, most systems allocate a fixed
pool of memory to the networking system and impose a  policy
optimized for ``normal'' network operation.

     The networking system developed for UNIX is little dif-
ferent  in  this  respect.   At  boot time a fixed amount of
memory is allocated by  the  networking  system.   At  later
times  more  system  memory  may  be  requested  as the need
arises, but at no time is memory ever returned to  the  sys-
tem.  It is possible to garbage collect memory from the net-
work, but difficult.  In order to perform this garbage  col-
lection  properly,  some portion of the network will have to
be ``turned off''  as  data  structures  are  updated.   The
interval  over which this occurs must kept small compared to
the average inter-packet arrival time, or too  much  traffic
may  be  lost, impacting other hosts on the network, as well
as increasing load on the interconnecting mediums.   In  our
environment  we have not experienced a need for such compac-
tion, and thus have left the problem unresolved.

     The mbuf structure was introduced  in  chapter  5.   In
this  section a brief description will be given of the allo-
cation mechanisms, and policies used  by  the  protocols  in
performing connection level buffering.

_1_2._1.  _M_e_m_o_r_y _m_a_n_a_g_e_m_e_n_t

     The basic memory allocation routines place no  restric-
tions  on  the  amount of space which may be allocated.  Any
request made is filled until  the  system  memory  allocator
starts  refusing  to  allocate  additional memory.  When the
current quota of memory is insufficient to satisfy  an  mbuf
allocation  request, the allocator requests enough new pages
from the system to satisfy the current  request  only.   All
memory  owned  by the network is described by a private page
table used in remapping pages to be logically contiguous  as
the  need arises.  In addition, an array of reference counts
parallels the page table and is used when multiple copies of
a page are present.

     Mbufs are 128 byte structures, 8 fitting  in  a  1Kbyte
page  of memory.  When data is placed in mbufs, if possible,
it is copied or remapped into logically contiguous pages  of
memory  from  the  network page pool.  Data smaller than the
size of a page is copied into one or more 112 byte mbuf data
areas.


Networking Implementation  - 35 -Buffering and congestion control


_1_2._2.  _P_r_o_t_o_c_o_l _b_u_f_f_e_r_i_n_g _p_o_l_i_c_i_e_s

     Protocols reserve fixed amounts of buffering  for  send
and  receive  queues at socket creation time.  These amounts
define the high and low water marks used by the socket  rou-
tines  in deciding when to block and unblock a process.  The
reservation of space does not currently result in any action
by the memory management routines, though it is clear if one
imposed an upper bound  on  the  total  amount  of  physical
memory  allocated  to  the  network,  reserving memory would
become important.

     Protocols which provide connection level  flow  control
do  this  based  on  the  amount  of space in the associated
socket queues.  That is, send windows are  calculated  based
on  the  amount of free space in the socket's receive queue,
while receive windows are adjusted based on  the  amount  of
data awaiting transmission in the send queue.  Care has been
taken to avoid the ``silly window  syndrome''  described  in
[Clark82] at both the sending and receiving ends.

_1_2._3.  _Q_u_e_u_e _l_i_m_i_t_i_n_g

     Incoming packets from the network are  always  received
unless  memory allocation fails.  However, each Level 1 pro-
tocol input queue has an upper bound on the queue's  length,
and  any  packets exceeding that bound are discarded.  It is
possible for a host to be overwhelmed by  excessive  network
traffic (for instance a host acting as a gateway from a high
bandwidth  network  to  a  low  bandwidth  network).   As  a
``defensive''  mechanism the queue limits may be adjusted to
throttle network traffic load on a host.   Consider  a  host
willing to devote some percentage of its machine to handling
network traffic. If the cost of handling an incoming  packet
can  be  calculated  so that an acceptable ``packet handling
rate'' can be determined, then input queue  lengths  may  be
dynamically  adjusted based on a host's network load and the
number of packets awaiting processing.  Obviously,  discard-
ing packets is not a satisfactory solution to a problem such
as this (simply dropping packets is likely to  increase  the
load  on  a  network);  the  queue lengths were incorporated
mainly as a safeguard mechanism.

_1_2._4.  _P_a_c_k_e_t _f_o_r_w_a_r_d_i_n_g

     When packets can not be  forwarded  because  of  memory
limitations,  the  system generates a ``source quench'' mes-
sage.  In addition, any other  problems  encountered  during
packet  forwarding  are also reflected back to the sender in
the form of ICMP packets.  This helps hosts  avoid  unneeded
retransmissions.

     Broadcast packets are never forwarded due  to  possible
dire   consequences.    In   an   early   stage  of  network


Networking Implementation  - 36 -Buffering and congestion control


development, broadcast packets were forwarded and a  ``rout-
ing  loop'' resulted in network saturation and every host on
the network crashing.


Networking Implementation  - 37 -           Out of band data


_1_3.  _O_u_t _o_f _b_a_n_d _d_a_t_a

     Out of band data is a facility peculiar to  the  stream
socket  abstraction  defined.   Little  agreement appears to
exist as to what its semantics should be.  TCP  defines  the
notion  of  ``urgent data'' as in-line, while the NBS proto-
cols  [Burruss81]  and  numerous  others  provide  a   fully
independent  logical transmission channel along which out of
band data is to be sent.  In addition,  the  amount  of  the
data which may be sent as an out of band message varies from
protocol to protocol; everything from 1 bit to 16  bytes  or
more.

     A stream socket's notion of out of band data  has  been
defined  as  the  lowest  reasonable  common denominator (at
least reasonable in our minds); clearly this is  subject  to
debate.   Out of band data is expected to be transmitted out
of the normal sequencing and flow control constraints of the
data  stream.   A  minimum of 1 byte of out of band data and
one outstanding out of band message are expected to be  sup-
ported  by the protocol supporting a stream socket.  It is a
protocols perogative to support larger  sized  messages,  or
more than one outstanding out of band message at a time.

     Out of band data is maintained by the protocol and usu-
ally not stored in the socket's send queue.  The PRU_SENDOOB
and PRU_RCVOOB requests to the _p_r__u_s_r_r_e_q routine are used in
sending and receiving data.


Networking Implementation  - 38 -          Trailer protocols


_1_4.  _T_r_a_i_l_e_r _p_r_o_t_o_c_o_l_s

     Core to core copies can be expensive.  Consequently,  a
great  deal  of  effort  was spent in minimizing such opera-
tions.   The  VAX  architecture  provides   virtual   memory
hardware  organized  in  page  units.   To  cut down on copy
operations, data is kept in page sized units on page-aligned
boundaries  whenever possible.  This allows data to be moved
in memory simply by remapping the page instead  of  copying.
The  mbuf  and network interface routines perform page table
manipulations where needed, hiding the complexities  of  the
VAX virtual memory hardware from higher level code.

     Data enters the system in two ways: from the  user,  or
from  the network (hardware interface).  When data is copied
from the user's address space into the system  it  is  depo-
sited  in  pages  (if  sufficient data is present to fill an
entire page).  This encourages the user to transmit informa-
tion  in  messages  which  are a multiple of the system page
size.

     Unfortunately, performing a similar operation when tak-
ing  data  from the network is very difficult.  Consider the
format of an incoming packet.  A packet usually  contains  a
local network header followed by one or more headers used by
the high level protocols. Finally, the data, if any, follows
these headers.  Since the header information may be variable
length, DMA'ing the eventual data for the user into  a  page
aligned  area  of  memory  is  impossible  without  a priori
knowledge of the format (e.g. supporting only a single  pro-
tocol header format).

     To allow  variable  length  header  information  to  be
present  and  still ensure page alignment of data, a special
local network encapsulation may be  used.   This  encapsula-
tion,  termed a _t_r_a_i_l_e_r _p_r_o_t_o_c_o_l, places the variable length
header information after the data.  A fixed size local  net-
work  header  is then prepended to the resultant packet. The
local network header contains the size of the data  portion,
and a new _t_r_a_i_l_e_r _p_r_o_t_o_c_o_l _h_e_a_d_e_r, inserted before the vari-
able length information, contains the size of  the  variable
length  header  information.  The following trailer protocol
header is used to store information regarding  the  variable
length protocol header:

        struct {
               short     protocol;            /* original protocol no. */
               short     length;              /* length of trailer */
        };


     The processing of the trailer protocol is very  simple.
On  output,  the  local  network  header indicates a trailer
encapsulation is being used.  The protocol  identifier  also


Networking Implementation  - 39 -          Trailer protocols


includes  an  indication of the number of data pages present
(before the trailer protocol header).  The trailer  protocol
header  is  initialized  to  contain the actual protocol and
variable length header size, and appended to the data  along
with the variable length header information.

     On input, the interface routines identify  the  trailer
encapsulation  by the protocol type stored in the local net-
work header, then calculate the number of pages of  data  to
find  the beginning of the trailer. The trailing information
is copied into a separate mbuf and linked to  the  front  of
the resultant packet.

     Clearly, trailer protocols require cooperation  between
source and destination.  In addition, they are normally cost
effective only when sizable packets are used.   The  current
scheme  works because the local network encapsulation header
is a fixed size, allowing DMA operations to be performed  at
a  known  offset  from  the  first data page being received.
Should the local network  header  be  variable  length  this
scheme fails.

     Statistics collected indicate as much as 200Kb/s can be
gained by using a trailer protocol with 1Kbyte packets.  The
average size of the variable length header was 40 bytes (the
size  of  a minimal TCP/IP packet header).  If hardware sup-
ports larger sized packets, even greater gains may be  real-
ized.


Networking Implementation  - 40 -           Acknowledgements


_A_c_k_n_o_w_l_e_d_g_e_m_e_n_t_s

     The internal structure of the system is patterned after
the  Xerox  PUP  architecture  [Boggs79],  while  in certain
places the Internet protocol family has had a great deal  of
influence in the design.  The use of software interrupts for
process invocation is based on similar facilities  found  in
the VMS operating system.  Many of the ideas related to pro-
tocol modularity, memory management, and network  interfaces
are  based  on  Rob  Gurwitz's TCP/IP implementation for the
4.1BSD version of UNIX on the VAX [Gurwitz81].  Greg Chesson
explained  his  use  of  trailer  encapsulations in Datakit,
instigating their use in our system.


_R_e_f_e_r_e_n_c_e_s


[Boggs79]           Boggs, D. R., J. F. Shoch, E.  A.  Taft,
                    and R. M. Metcalfe; _P_U_P: _A_n _I_n_t_e_r_n_e_t_w_o_r_k
                    _A_r_c_h_i_t_e_c_t_u_r_e.  Report CSL-79-10.   XEROX
                    Palo Alto Research Center, July 1979.

[BBN78]             Bolt Beranek and  Newman;  _S_p_e_c_i_f_i_c_a_t_i_o_n
                    _f_o_r _t_h_e _I_n_t_e_r_c_o_n_n_e_c_t_i_o_n _o_f _H_o_s_t _a_n_d _I_M_P.
                    BBN Technical Report 1822.  May 1978.

[Cerf78]            Cerf, V.  G.;   The  Catenet  Model  for
                    Internetworking.     Internet    Working
                    Group, IEN 48.  July 1978.

[Clark82]           Clark, D. D.;  Window  and  Acknowledge-
                    ment  Strategy  in TCP. Internet Working
                    Group, IEN Draft Clark-2.  March 1982.

[DEC80]             Digital Equipment  Corporation;   _D_E_C_n_e_t
                    _D_I_G_I_T_A_L  _N_e_t_w_o_r_k  _A_r_c_h_i_t_e_c_t_u_r_e - _G_e_n_e_r_a_l
                    _D_e_s_c_r_i_p_t_i_o_n.   Order  No.   AA-K179A-TK.
                    October 1980.

[Gurwitz81]         Gurwitz,  R.  F.;   VAX-UNIX  Networking
                    Support    Project    -   Implementation
                    Description.     Internetwork    Working
                    Group, IEN 168.  January 1981.

[ISO81]             International Organization for Standard-
                    ization.   _I_S_O _O_p_e_n _S_y_s_t_e_m_s _I_n_t_e_r_c_o_n_n_e_c_-
                    _t_i_o_n - _B_a_s_i_c  _R_e_f_e_r_e_n_c_e  _M_o_d_e_l.   ISO/TC
                    97/SC 16 N 719.  August 1981.

[Joy82a]            Joy, W.; Cooper, E.; Fabry, R.; Leffler,
                    S.;  and  McKusick,  M.;  _4._2_B_S_D  _S_y_s_t_e_m
                    _M_a_n_u_a_l.    Computer   Systems   Research


Networking Implementation  - 41 -                 References


                    Group,  Technical  Report 5.  University
                    of California, Berkeley.  Draft of  Sep-
                    tember 1, 1982.

[Postel79]          Postel,  J.,  ed.   _D_O_D  _S_t_a_n_d_a_r_d   _U_s_e_r
                    _D_a_t_a_g_r_a_m   _P_r_o_t_o_c_o_l.   Internet  Working
                    Group, IEN 88.  May 1979.

[Postel80a]         Postel, J., ed.  _D_O_D  _S_t_a_n_d_a_r_d  _I_n_t_e_r_n_e_t
                    _P_r_o_t_o_c_o_l.   Internet  Working Group, IEN
                    128.  January 1980.

[Postel80b]         Postel, J., ed.  _D_O_D _S_t_a_n_d_a_r_d  _T_r_a_n_s_m_i_s_-
                    _s_i_o_n _C_o_n_t_r_o_l _P_r_o_t_o_c_o_l.  Internet Working
                    Group, IEN 129.  January 1980.

[Xerox81]           Xerox Corporation.   _I_n_t_e_r_n_e_t  _T_r_a_n_s_p_o_r_t
                    _P_r_o_t_o_c_o_l_s.   Xerox   System  Integration
                    Standard 028112.  December 1981.

[Zimmermann80]      Zimmermann, H.  OSI  Reference  Model  -
                    The  ISO  Model of Architecture for Open
                    Systems Interconnection.  IEEE  Transac-
                    tions   on  Communications.   Com-28(4);
                    425-432.  April 1980.