Athena TM 4 D R A F T 31 777 TTThhheee nnnaaammmeeessspppaaaccceee ppprrroootttooocccooolll Two different kinds of interchanges take place in locating a file and establishing a connection to the file. The requests to the namespace-name server and to the namespace manager are small, and require only a single message of request and a single message of reply. If a reply is not forthcoming, the request can be repeated. The exchanges between a user program and an open file however, may require multiple packets which must be reliably delivered, and must be delivered in order. I think this division indicates the underlying protocols to be used for the exchanges: requests to the namespace-name server and replies from the namespace-name server are UDP datagrams. The initial pathname request to the namespace manager, and the manager's reply are also carried by UDP datagrams. A file is opened by opening a full-duplex TCP connection to the agent (the channel structure is a specification for the TCP connection), and descriptor-oriented requests and their replies travel through the TCP connection. Thus, pathname-oriented system calls happen by means of UDP datagrams, descriptor-oriented system calls happen by means of a TCP connection. In the context of remote access, a descriptor and a TCP connection may be considered synonymous. Recall that when looking for an object, the namespace manager may return a descriptor for accessing the file, or it may return another file name to use in continuing the search for the object. As an optimization for those cases where the namespace manager would return a remote access descriptor, which would then be used to perform an operation (e.g., aaacccccceeessssss,,, ccchhhmmmoooddd) we piggyback the operation request on top of the initial request to locate the object, so only one message to a namespace manager is necessary. In that case the namespace manager returns a positive result or a negative result. In fact, ooopppeeennn is the only system call which does not follow this pattern, and is thus the only system call which may actually return a remote access descriptor. Another way of saying this is that ooopppeeennn is the only system call which turns a pathname into a descriptor. 777...000...111 DDDaaatttaaagggrrraaammm uuunnniiiqqquuueeennneeessssss aaannnddd ooorrrdddeeerrriiinnnggg It is important that datagram-based requests arrive in order (imagine the consequences of an uuunnnllliiinnnkkk requested before a cccrrreeeaaattteee arriving after the cccrrreeeaaattteee is performed) and that repeated packets be ignored. To allow the proper sequencing of requests, the user sends a request, then waits for a reply to the request before sending another request. To allow requests repeated because a reply was not forthcoming to be discarded, each request packet contains a process-ID and a sequence number. The only thing that is guaranteed about the sequence number is that during any particular five minute period the sequence number for a given 27 April 1984 Athena TM 4 D R A F T 32 32 process-ID increases monotonically (modulo 2 ), not that it increases by increments of one. This allows the user program to use a single sequence number for all the datagram requests it sends to all the hosts with which it corresponds. After five minutes the server may forget the last sequence number it received from a particular process-ID. This is probably sufficient to allow a host to crash and restart and allow new processes (coincidentally using process-IDs which were in use before the crash) to communicate with old servers who might otherwise remember the old process-ID's sequence numbers. Therefore, no provision need be made to remember process-IDs or sequence numbers across crashes--the sequence number knowledge 18 will time-out and be forgotten. 777...111 DDDaaatttaaa ssstttrrruuuccctttuuurrreeesss aaannnddd pppaaaccckkkeeettt fffooorrrmmmaaatttsss 777...111...111 GGGeeennneeerrraaalll rrreeeqqquuueeesssttt pppaaaccckkkeeettt fffooorrrmmmaaattt A request from a user to an object manager (or to the namespace name server) is a packet encapsulated in a UDP datagram (or may be transmitted across a TCP connection to the server). It consists of a header and data. The packet format is: _______________ 18 There is a slight problem with this scheme: it assumes processes whose active lives are short relative to the amount of time a host is up. Many daemons are started as soon as the host comes up and are active throughout a host's lifetime. In UNIX, since daemon processes are all started up in the same order by /etc/rc every time, they also always come up with the same UNIX process-ID. Therefore using UNIX process IDs as process-IDs will have some problems (which I think are solvable, but the means of solving this problem which come to mind are too ugly for me to want to describe them here). 27 April 1984 Athena TM 4 D R A F T 33 struct request_packet { char protocol_version; char reserved; short request_type; long uid; long nonce; long hint; long process_id; long sequence_number; union { struct { short name_length; char name[ARBITRARY_LENGTH]; } namespace_name; struct argstruct request_arguments; } requ; }; The structure elements are: - ppprrroootttooocccooolll___vvveeerrrsssiiiooonnn -- The protocol version number is included to allow us to change the protocol in the future, and to detect attempts to use older versions (if not to actually remain compatible with older versions). - rrreeeqqquuueeesssttt___tttyyypppeee -- The request type is a constant defining which system call this packet represents, or if the packet is a request to a namespace manager. - uuuiiiddd -- A 32-bit quantity which identifies the user on whose behalf the request is being made (see the sections on authentication). - nnnooonnnccceee -- A nnnooonnnccceee is a number which is interpreted only by the user machine, and is ignored by the server. The server simply returns the nonce unchanged in its reply packet. - hhhiiinnnttt -- The hhhiiinnnttt is a number interpreted only by the server end. It may be 0 (and will be in all pathname-oriented requests), or it is the number which the namespace manager returned in the rpc_channel reply when returning a descriptor for the file. The user end of the exchange puts no interpretation on the hhhiiinnnttt, and 19 simply passes it back to the server. - ppprrroooccceeessssss___iiiddd and ssseeeqqquuueeennnccceee___nnnuuummmbbbeeerrr -- These elements _______________ 19 I owe the idea of the nnnooonnnccceee and the hhhiiinnnttt to Mike Greenwald's description of the MIT-LCS Remote Virtual Disk protocol (Greenwald, personal communication). 27 April 1984 Athena TM 4 D R A F T 34 provide a means of detecting and discarding duplicate packets. For a description of the mechanism used, see the section "Datagram uniqueness and ordering". - rrreeeqqquuueeesssttt___aaarrrggguuummmeeennntttsss are the arguments to the system call. Their format varies from request type to request type, and is described below. - nnnaaammmeeessspppaaaccceee___nnnaaammmeee The namespace name is passed as a string length followed by a string of bytes (in VAX, PDP11, and NS16032 ordering of bytes in words). 777...111...222 GGGeeennneeerrraaalll rrreeeppplllyyy pppaaaccckkkeeettt fffooorrrmmmaaattt The server's reply to a user's request may also be encapsulated in a datagram or shipped across a TCP connection. The packet has the following structure: struct reply_packet { char protocol_version; char reserved; short request_type; long nonce; long hint; long process_id; long sequence_number; struct rpc_reply rpc_reply; }; The elements of this structure are: - ppprrroootttooocccooolll vvveeerrrsssiiiooonnn -- Means the same thing as in the request header. - rrreeeqqquuueeesssttt___tttyyypppeee,,, nnnooonnnccceee,,, ppprrroooccceeessssss___iiiddd,,, ssseeeqqquuueeennnccceee___nnnuuummmbbbeeerrr -- are all copied from the header of the request to which this is a reply. - hhhiiinnnttt -- The user end will save this hint for use in future requests. 777...111...333 TTThhheee ssstttrrruuuccctttuuurrreee aaa pppaaattthhhnnnaaammmeee rrreeemmmooottteee ppprrroooccceeeddduuurrreee cccaaallllll rrreeetttuuurrrnnnsss The ppprrrpppccc routine, which implements the PATHNAME object, and the dddrrrpppccc routine, which implements the DESCRIPTOR object, return an rrrpppccc___rrreeeppplllyyy structure embedded in a rrreeeppplllyyy___pppaaaccckkkeeettt structure. The rrrpppccc___rrreeeppplllyyy structure looks like this: 27 April 1984 Athena TM 4 D R A F T 35 struct rpc_reply { char reply_type; char reserved; short length_of_data; union { short result; long long_result; short errno; char string[ARBITRARY_LENGTH]; struct rpc_channel channel; } data; } This is composed of the following parts: - tttyyypppeee is one of RPC_CHANNEL, PATHNAME, DATA, POSITIVE_RESULT, or NEGATIVE_RESULT, and is used to tell the calling routine how to interpret the union. [dddrrrpppccc will never return RPC_CHANNEL or PATHNAME, since they are only involved in locating the file given a pathname.] - llleeennngggttthhh___ooofff___dddaaatttaaa is the length of the structure contained in the following union. For example, if the type of reply is PATHNAME, then llleeennngggttthhh___ooofff___dddaaatttaaa is the length of the pathname in bytes, if the type of the reply is IPADDR, then llleeennngggttthhh___ooofff___dddaaatttaaa is sssiiizzzeeeooofff(((rrrpppccc___ccchhhaaannnnnneeelll))). - rrreeesssuuulllttt is the integer result that the system call returns (most system calls return integers of one sort or another). - lllooonnnggg___rrreeesssuuulllttt is provided for those system calls (lllssseeeeeekkk) which return a lllooonnnggg rather than an iiinnnttt, and is not tl stric20y necessary in a 32-bit implementation of UNIX. - eeerrrrrrnnnooo is the error number to stick into the global variable eeerrrrrrnnnooo. This interpretation is used only when the type of the reply is NEGATIVE_REPLY. - ssstttrrriiinnnggg is used for PATHNAME returns, and can be any length, up to the maximum length of a pathname, which is currently 1024 characters. ssstttrrriiinnnggg is also used to _______________ 20 As a former PDP-11 UNIX programmer I view all the hidden "an iiinnnttt is 32 bits long" assumptions that appear in the 4.2BSD sources with extreme distaste. These are _b_a_d programming practices, and the people at Berkeley should know better, especially since they had to _r_e_m_o_v_e things which made the assumptions explicit that were present in the original Version 7 sources. Shame, shame, shame. 27 April 1984 Athena TM 4 D R A F T 36 provide a buffer in which to return the results of any system call (e.g., rrreeeaaaddd,,, ssstttaaattt) which returns a buffer full of data. - ccchhhaaannnnnneeelll is a channel description used to access the file (currently, ooopppeeennn is the only system call that returns one of these). The rrrcccppp___ccchhhaaannnnnneeelll structure is described elsewhere. 777...111...444 SSSuuubbbrrrooouuutttiiinnneee aaarrrggguuummmeeennntttsss Each argument is passed in the argument buffer as a byte-stream containing the data that forms the argument preceded by a 16-bit length-of-data field. Argument lengths always begin on an even-byte boundary. For example, the arguments "Chris" and "Joe" are passed as: 05 "Ch" "ri" "s"- 03 "Jo" "e"- 0x5 0x4368 0x7269 0x7300 0x3 0x4a6f 0x6500 (this is VAX, PDP11, and NS16032 order). A number of unions make it easier to access most system calls from C routines (although note that the order of the arguments here is _n_o_t the same as if these were real remote procedure calls). struct argstruct { union { struct mode_struct{ short mode_len; short mode; short name_len; char name[filename_len]; } modes; struct open_struct{ short flag_len; short flags; short mode_len; short modes; short name_len; char name[filename_len]; } open; struct name_struct { short name_len; char name[filename_len]; } name; } argu; } Each of the structures defined in the aaarrrggguuu union are appropriate forms for the arguments of different system calls to take. For an example of the details of one of these structures, see the definition of the ppprrrpppccc routine. 27 April 1984 Athena TM 4 D R A F T 37 777...111...555 RRReeemmmooottteee aaacccccceeessssss dddeeessscccrrriiippptttooorrrsss Analogous to file descriptors, each process has a table of remote access descriptors. A remote access descriptor is a small integer (it fits the same description as a file descriptor) which serves as an index into a table of remote access file structures: struct ral_file { char type; char reserved; short refct; short real_fd; struct rpc_channel channel; } *ralfd_tab[NOFILES]; tttyyypppeee is one of LOCAL or REMOTE. rrreeefffcccttt is a reference count (since multiple virtual file descriptors can point to a single remote access descriptor, by grace of ddduuuppp and ddduuuppp222). There are a couple of routines provided for allocating, and freeing remote access descriptors. /* find an unallocated ralfd_tab entry */ ral_file_alloc() { register int i; for(i = 0; i < NOFILES; i++) if(ralfd_tab[i] == NULL) { if((ralfd_tab[i] = malloc(sizeof(struct ral_file))) == NULL) return(-1); ralfd_tab[i]->refct = 0; return(i); } /* too many files are already open */ return(-1); } /* free an ralfd_tab entry */ ral_file_free(fd) { if(ralfd_tab[fd]->refct-- <= 0) free(ralfd_tab[i]); } 777...111...666 TTThhheee rrreeemmmooottteee aaacccccceeessssss ccchhhaaannnnnneeelll ssstttrrruuuccctttuuurrreee The namespace manager returns one of an error, another pathname to use, or a channel for communicating with the file. The channel is a structure which looks like: 27 April 1984 Athena TM 4 D R A F T 38 struct rpc_channel { long ip_addr; short port; short hint; }; The iiipppaaaddddddrrr field is the IP address of the host on which the manager for this particular object resides. The pppooorrrttt is the UDP port to which to send requests for operations, and the hhhiiinnnttt is a 16-bit field which must be included with all operation requests, and is interpreted solely by the object manager. 777...222 RRReeemmmooottteee ppprrroooccceeeddduuurrreee cccaaallllllsss 777...222...111 TTThhheee pppaaattthhhnnnaaammmeee rrreeemmmooottteee ppprrroooccceeeddduuurrreee cccaaallllll rrrooouuutttiiinnneee The ppprrrpppccc routine takes a varying number of arguments, depending on which system call is being invoked remotely, and packages a remote procedure call request for the remote system. If the namespace manager returns a new pathname to use ppprrrpppccc checks the name to see if it is local, if the new pathname _i_s local, then ppprrrpppccc returns the new pathname to the calling routine, otherwise it locates the new namespace manager, and invokes the operation on the new pathname. The ppprrrpppccc routine looks something like this: 27 April 1984 Athena TM 4 D R A F T 39 struct rpc_reply prpc(system_call, arg1, arg2, arg3, arg4, arg5, arg6, arg7) { struct rpc_request request; /* * Construct a request packet for this call. * First, the invariant part. */ request.protocol_version = PROTOCOL_VERSION; request.request_type = system_call; request.nonce = make_nonce(); request.uid = getuid(); request.hint = 0; request.process_id = ral_pid; request.sequence_number = make_sequence_number(); /* * Now the variable part. */ switch(system_call) { case ACCESS: case CHMOD: case CHOWN: case MKDIR: return prpc_path_mode(&request, arg1, arg2); case LINK: case SYMLINK: return prpc_opath_npath(&request, arg1, arg2); case OPEN: return prpc_open(&request, arg1, arg2, arg3); case RMDIR: case UNLINK: return prpc_path(&request, arg1); case READLINK: return prpc_path_buf_buflen (&request, arg1, arg2, arg3); case STAT: return prpc_path_buf_buflen (&request, arg1, arg2, sizeof(struct statbuf)); } } Here is a sketch of a possible implementation of the routine which implements aaacccccceeessssss,,, ccchhhmmmoooddd,,, ccchhhooowwwnnn, and mmmkkkdddiiirrr. 27 April 1984 Athena TM 4 D R A F T 40 struct rpc_reply prpc_path_mode(request, pathname, mode) { char *filename; char *namespace_name; struct rpc_reply namespace_reply; request->args.mode.mode_len = sizeof(int); request->args.mode.modes = mode; path_and_mode_name_loop: namespace_name = get_namespace_name(pathname); filename = get_filename((char *) pathname); namespace_reply = namespace_nameserver(namespace_name); switch(namespace_reply->type) { case NEGATIVE_REPLY: garbage collect all allocated structures; return an error indicating a non-existent file; case IPADDR: break; default: garbage collect allocated structures; return an indication of an illegal reply from the namespace server. } /* * Now we know how to get in touch with the namespace * manager. */ channel = reply->data.channel; request->args.pathname_length = strlen(filename); copy the string at filename into request->args.pathname. (this probably involves allocating a buffer for it). reply = send_to_channel(&channel, request); if(reply->type == PATHNAME) { if(is_local_filename(reply->data.string)) { garbage collect allocated structures; return(reply); } else { namespace_name = get_namespace_name(reply->data.string); filename = get_filename(reply->data.string); goto path_and_mode_name_loop; } } else { garbage collect allocated structures; return(reply); } } The ppprrrpppccc routine returns a pointer to an allocated rpc_reply structure. The calling routine must explicitly garbage collect the structure when it is done with it. 27 April 1984 Athena TM 4 D R A F T 41 777...222...222 DDDeeessscccrrriiippptttooorrr rrreeemmmooottteee ppprrroooccceeeddduuurrreee cccaaallllllsss The DESCRIPTOR object is implemented by the dddrrrpppccc (descriptor remote procedure call) routine, which is largely analogous to the ppprrrpppccc routine. The dddrrrpppccc routine also returns a pointer to an rrrpppccc___rrreeeppplllyyy structure, and it looks like this: 27 April 1984 Athena TM 4 D R A F T 42 struct rpc_reply * drpc(system_call, descriptor, arg1, arg2, arg3, arg4, arg5) { struct rpc_request request; /* * Set up the request packet for this call. */ request.protocol_version = PROTOCOL_VERSION; request.request_type = system_call; request.nonce = make_nonce(); request.uid = getuid(); request.hint = ralfd_tab[descriptor]->channel.hint; request.process_id = ral_pid; request.sequence_number = make_sequence_number(); switch(operation){ case FCHMOD: case FCHOWN: case FTRUNCATE: return drpc_long(&request, ralfd_tab[descriptor], arg1); case CLOSE: case FSYNC: return drpc_none(&request, ralfd_tab[descriptor]); case LSEEK: return drpc_long_long(&request, ralfd_tab[descriptor], arg1, arg2); case READ: return drpc_read(&request, ralfd_tab[descriptor], arg1, arg2); case WRITE: return drpc_write(&request, ralfd_tab[descriptor], arg1, arg2); case FSTAT: return drpc_fstat(&request, ralfd_tab[descriptor], arg1); } } Note that the dddrrrpppccc routine will never return a PATHNAME, since the location of the object has already been determined. 27 April 1984 Athena TM 4 D R A F T I TTTaaabbbllleee ooofff CCCooonnnttteeennntttsss 7 The namespace protocol 31 7.0.1 Datagram uniqueness and ordering 31 7.1 Data structures and packet formats 32 7.1.1 General request packet format 32 7.1.2 General reply packet format 34 7.1.3 The structure a pathname remote procedure 34 call returns 7.1.4 Subroutine arguments 36 7.1.5 Remote access descriptors 37 7.1.6 The remote access channel structure 37 7.2 Remote procedure calls 38 7.2.1 The pathname remote procedure call routine 38 7.2.2 Descriptor remote procedure calls 41 27 April 1984