Project Athena Technical Memorandum number 4: Namespaces and databases: steps toward implementing a distributed file system for Athena Dave Mankins, BBN 20 November 1984 Abstract An overview of the steps toward implementing a distributed filesystem similar to the one described in Athena Technical Memorandum no. 1 is described, along with a detailed presentation of the first step. An argument is presented for thinking of a filesystem as a database, and the name of a file as a key. A mechanism for accessing objects in different database or filesystem servers from within the Athena system is presented. This mechanism relies on the concept of namespaces and namespace managers, which are located with the cooperation of namespace-name servers. Network transparency is thrown to the winds. NOTE: this is an incomplete draft. Athena TM 4 D R A F T 1 111 IIInnntttrrroooddduuuccctttiiiooonnn 1 The Locus -like distributed filesystem described in Athena Technical Memorandum number 1, _S_o_m_e _t_h_o_u_g_h_t_s _c_o_n_c_e_r_n_i_n_g _a _d_i_s_t_r_i_b_u_t_e_d _f_i_l_e_s_y_s_t_e_m _f_o_r _A_t_h_e_n_a (TM#1) represents a very ambitious project which depends on many results that are currently research. Project Athena needs a reliable (as in "non-buggy") distributed filesystem _s_o_o_n, the system described in TM#1, while desirable, is likely to require several years to implement. So, how do we get there from here, and have something useful in the meantime? The answer is: "In small steps." I intend to describe in this memorandum a series of small steps toward the ultimate goal, many of which will be useful in and of themselves. _______________ 1 "Locus" is a trademark of Locus Computing Corporation. 20 November 1984 Athena TM 4 D R A F T 2 222 GGGoooaaalllsss,,, aaannnddd ppprrrooobbbllleeemmmsss wwwiiittthhh TTTMMM###111 The goals of a distributed filesystem for Athena, as described in TM#1, are: 1. The system has to be designed with the awareness that it will eventually encompass several thousand machines, with machines coming on-line or going off-line at any time. Therefore, the namespace must be flexible, and, it would be nice if the user didn't 2 have to "mount" every filegroup before using it. 2. We want resources to be both widely and reliably available. 3. The distribution of the filesystem should be _t_r_a_n_s_p_a_r_e_n_t. That is, the user should not be concerned with the actual location of a file, and the location of the file should not be built into the file's name. An attractive model is that the ATHENA 3 system represents one virtual machine. 4. Since a distributed filesystem will not appear tomorrow, and since Project Athena has elected to use 4 the 4.2BSD version of the UNIX operating system as a basis, in order to take advantage of the large body of existing software, we would like the existing file operations (e.g., ooopppeeennn,,, ccclllooossseee,,, rrreeeaaaddd,,, wwwrrriiittteee,,, cccrrreeeaaattteee,,, llliiinnnkkk,,, fffllloooccckkk, and ssstttaaattt) to work in a backwards _______________ 2 Why not mount filegroups (that is, attach filegroups to nodes in the local filesystem tree)? If we have a location-dependent filesystem (which I think we will have in the early stages of development) then there will be thousands of such nodes. We must have the filegroups mounted (or at least have a place for them to be attached), because when a user begins a session they have little idea which filegroups they will use (also, many programs use a number of files of which the user is never informed, except when they aren't available). Therefore, we want our system to go out and find filegroups when they aren't mounted. 3 While this is a desirable goal, I think it is also expendable. In the early stages of development I intend to sacrifice this goal in order to construct a system that will allow resource sharing, as described below. 4 UNIX is a trademark of Bell Laboratories, Western Electric, or ATT, or someone. 20 November 1984 Athena TM 4 D R A F T 3 compatible way. Therefore, we want our system's file operations to be a superset of the existing UNIX operations. An important goal which was largely overlooked in TM#1 is encouraging people to share resources and tools. We want to make it easy for programs to access files and databases throughout the network. As pointed out in TM#2, the distinction between the personal computers on the Athena system and the file servers was ignored in TM#1. Athena's personal computers, at least in the early stages of the project are going to be public resources: they will sit in libraries, classrooms, and computer-lounges. You will use one machine one day, and a different machine the next day. Storage allocation rules could have files created by processes running on a computer be stored on the same computer. This is not a good idea if you're not likely to be using that particular computer the next day. While it is true that the personal computers will have local storage, it is not clear that such storage should be fully integrated into the Athena distributed filesystem. Part of the advantage of a personal workstation is that you get to use all of it while you are using it (it isn't shared). This advantage is diminished if your personal workstation is storing someone else's files and is busy serving their requests for the data in those files. Therefore we are likely to make some distinction between a workstation and a server machine. 20 November 1984 Athena TM 4 D R A F T 4 333 IIImmmpppllleeemmmeeennntttiiinnnggg mmmuuullltttiiipppllleee nnnaaammmeeessspppaaaccceeesss I intend to distribute the filesystem in a series of steps. The glue that holds these separate steps together, and which keeps them compatible, is a new view of the division of the name space. I think a good approach is to provide _s_e_v_e_r_a_l namespaces, each specified by a different name. I view the namespace designator as being separate from the UNIX filename, and possibly taking this form: /@namespace-designator/filename In the current treatment of UNIX filenames, the namespace is implicit: the namespace is the filesystem of the local machine. This model is preserved by systems such as the Newcastle Connection and Locus. If we move to a Locus-like filesystem, then the implicit namespace will be the local distributed Locus filesystem. The namespace designator is not a filename, it is a name to be submitted to a name-server. The name-server will tell the system how to get in touch with the namespace by that name. Many namespaces may exist in parallel (and may even overlap). For example, during early stages of development of a distributed filesystem, the namespace designator could designate a particular machine, e.g., /@mit-pygmalion/usr/dm/dfs/doc/plan.mss and at later stages the namespace designator could designate a Locus-like filesystem, or some non-UNIX filesystem, or even something that is more obviously a database manager and less 5 obviously what is traditionally described as a filesystem. All of these systems could coexist on the network at the same time. A point implicit in my thinking is that the local machine's nnnaaammmeeeiii routine figures out what the namespace it is concerned with and then hands the filename off to the namespace manager. The reasons for making the namespace name syntactically distinctive, instead of making the namespaces be fake entries in the root directory (or some other directory) are: _______________ 5 For example, a relational database server may define some subset of the filesystem operations on its objects, allowing a program to: open("/@rdb/the list of people in my dorm taking 24.05", READ) 20 November 1984 Athena TM 4 D R A F T 5 1. _A_l_l_o_w_i_n_g _n_a_m_e_s_p_a_c_e_s _t_o _c_o_m_e _a_n_d _g_o _w_i_t_h_o_u_t _c_h_a_n_g_i_n_g _t_h_o_u_s_a_n_d_s _o_f _f_i_l_e_s_y_s_t_e_m_s. If many namespace names are _a_l_s_o machine names, then adding a machine to the network will6require a lot of work updating everyone's filesystems. 2. _A_l_l_o_w_i_n_g _u_s_e_r_s _t_o _h_a_v_e _t_h_e_i_r _o_w_n _p_r_i_v_a_t_e _s_e_t _o_f _n_a_m_e_s_p_a_c_e_s. We may want to define a mechanism by which users can define their own namespace-name servers (probably in addition to the system's namespace-name server). This wouldn't work if the namespace names appear as special files in the filesystem. 3. _S_a_v_i_n_g _a _t_r_i_p _t_o _t_h_e _f_i_l_e_s_y_s_t_e_m. If the namespace designator is indistiguishable from a filename, then, in order to distinguish between a namespace designation and a file (or directory) designation, the local filesystem has to be consulted. It is likely to be impractical to list all of the namespaces in every machine's filesystem. Therefore, if namespace-names look exactly like filenames, then the failure to locate a file in the filesystem might indicate that the name designates a namespace. In that case the system must send an inquiry to the namespace-name server. Usually one fails to locate files in the local filesystem because one mis-typed a filename, and it seems wasteful to consult the namespace nameserver in such cases. 4. _A_l_l_o_w_i_n_g _f_i_l_e_n_a_m_e_s _a_n_d _n_a_m_e_s_p_a_c_e _n_a_m_e_s _t_o _o_v_e_r_l_a_p. If namespace names look just like filenames things will get confusing. If you choose to have a file which has the same name as a namespace, then probably the namespace will be inaccessible. This is a problem because the knowledge of what namespaces actually exist may be widely distributed, and it may not be easy for the user to discover that they have created a duplicate name. "/@" seems like a good way to indicate "what follows is a namespace designator", as the '@' (at-sign) carries an appropriate semantic meaning (that the file resides at the namespace), and it seems unlikely that we will be stuck with some software package that depends on the ability to make filenames _______________ 6 Machines could be hierarchically arranged, which would require that only a few filesystems would have to be changed to add the new machine. 20 November 1984 Athena TM 4 D R A F T 6 7 that begin with @ _i_n _t_h_e _r_o_o_t _d_i_r_e_c_t_o_r_y. 333...111 PPPooossssssiiibbbllleee aaadddvvvaaannnccceeeddd uuussseeesss ooofff nnnaaammmeeessspppaaaccceee mmmaaannnaaagggeeerrrsss An important thing to note is that this scheme defines a protocol for manipulating objects with the UNIX file operations. Reading and writing are useful things to be able to do to all kinds of objects. Possible applications I can see for this scheme are: - AAA ~~~ nnnaaammmeeessspppaaaccceee mmmaaannnaaagggeeerrr. By special casing '~' to be synonymous with '/@~/' one can make the C Shell's home-directory locating mechanism available from within any program without writing special code anywhere (except for the namespace manager). Any program may open '~dm', as the namespace manager would return the real pathname to use as part of the standard namespace manager protocol. - LLLooogggiiicccaaalll nnnaaammmeeesss. A logical name server can be provided, allowing users to define logical names for files which may be interpreted by any program, without the program having to be specially modified to understand the notion of logical names. - SSSyyysssttteeemmm dddaaatttaaabbbaaassseeesss. For example, one could write a namespace manager to manage the namespace of user mailboxes, replacing the /usr/lib/aliases file and the subroutines for parsing it by just having the mail sending program open /@mailbox/dm for writing. - GGGeeennneeerrraaallliiizzzeeeddd ooobbbjjjeeeccctttsss. For example, a mailbox object could be implemented by a program which defines rrreeeaaaddd and wwwrrriiittteee operations on a mailbox. These operations may behave differently depending on who attempts them (for example, people may be allowed to read the messages they have sent to other people's mailboxes). Arguably, this may be what the vaguely-defined concept of "portals" in the 4.2BSD system manual are. - NNNooonnn---kkkeeerrrnnneeelll dddeeevvviiiccceee dddrrriiivvveeerrrsss. Devices are "special files" in the UNIX system. This mechanism will allow one to construct all kinds of "files" with special behaviour _______________ 7 The "/@" notation is due to Dan Franklin of Interactive Systems, Inc. The concern for portability of this system extends to systems running UNIX System V, an environment where commercial applications are forced to squeeze as much information into 14 characters (the maximum length of a filename in System V and earlier versions of UNIX) as possible. Prohibiting @ from being the first character of a file name in the root directory does not seem particularly restrictive. 20 November 1984 Athena TM 4 D R A F T 7 without sticking them into the kernel. Such device drivers might be terminal managers providing advanced terminal services (or even primitive ones, like stopping when the screen fills up). These are just a few of the possible uses for this mechanism that have occured to me after a little thought. 333...222 WWWhhhaaattt hhhaaappppppeeennneeeddd tttooo nnneeetttwwwooorrrkkk tttrrraaannnssspppaaarrreeennncccyyy??? This scheme _d_o_e_s throw network transparency (that is, the independence of filenames and file locations) out the window, particularly in the early stages of development, where each host will have its own namespace manager. However, 1. Building absolute pathnames of files into programs so that the programs break when the files move is an old problem, which will be with us even when a file moves from one name to another in a network-transparent system. 2. This scheme will allow programs to locate the right copy of a file when the program moves to another machine. 3. This scheme provides a mechanism for constructing a (dare I say it?) logical-name server, which would provide an easy to use alternative to absolute pathnames. Namespace designators will allow us to gradually introduce more namespaces as servers for them are implemented. Also, many namespaces can coexist, allowing earlier implementations to remain useful after more sophisticated systems are brought up. This flexibility will also come in handy when introducing other machines into the network, since they may provide different levels of remote access or distributed filesystem services. For that matter, if we ultimately go to a Locus-like filesystem, we may wish to have our Athena network divided into several Locus systems, and this mechanism will allow us to share files among those systems. The implementation described here in which each machine has it's own namespace server does not have to be the use which we emphasize when the system is released. If possible, we should provide logical-name servers so that users do not have to build file locations into their programs, but may rely on a logical-name server to provide the location of a given file at run-time (similar to putting the filename into the environment). 20 November 1984 Athena TM 4 D R A F T 8 333...333 DDDiiissstttiiinnnccctttiiiooonnn bbbeeetttwwweeeeeennn nnnaaammmeeessspppaaaccceee---nnnaaammmeee ssseeerrrvvveeerrrsss aaannnddd nnnaaammmeeessspppaaaccceee mmmaaannnaaagggeeerrrsss Please note that a namespace-name server and a namespace manager are two different thigns. A namespace-name server takes a namespace name as a key and tells you how to contact the namespace manager. A namespace manager takes a filename as a key and tells you how to contact the file (at the very least) and also probably manages (or starts the manager for) the operations on the file. 20 November 1984 Athena TM 4 D R A F T 9 444 AAA fffiiillleeesssyyysssttteeemmm iiisss aaa dddaaatttaaabbbaaassseee Another way to view a filesystem server is as a database manager. A UNIX filesystem is basically a structured database with one key: the filename. It is easy to imagine a database manager which uses other keys, and which implements the UNIX file 8 operations on its records. Thus, a (namespace designator, filename) pair can be viewed as a (database name, key) pair. Realizing that a filesystem is a database is a handy insight, and allows us to design a staged implementation plan for a filesystem like the one described in TM#1: - _R_e_m_o_t_e _a_c_c_e_s_s: Use the existing UNIX filesystem as a database manager. This scheme has two stages: 1. Producing a library of system-call emulators like those used in the Newcastle Connection scheme which implement a remote-procedure call mechanism communicating with a server. The server just performs the system call functions on the remote machine for the client. At this step we will have the namespace designator specify the host on which the file resides, e.g., "/@mit- pygmalion/usr/dm/dfs/doc/plan.mss". This can be viewed as a simple remote-access mechanism, and the user library will probably only be used by a few commands to debug the protocols. 2. Once the server protocols for the simple remote-access mechanism are debugged, move the system-call implementation into the kernel. These protocols will remain useful in subsequent developments. At this stage, the remote-access mechanism will be usable by the public. - _S_i_m_p_l_e _d_i_s_t_r_i_b_u_t_e_d-_f_i_l_e_g_r_o_u_p_s: Implement a simple distributed-filegroup server, that is, a server which manages a filegroup which uses storage on multiple machines. This server does not have to be distributed itself, it need not implement replication, it need not handle network partitions particularly gracefully, nor need the server be able to migrate to another machine should the machine on which it is running crash. In short, this distributed filegroup server is much simpler from the one which will become necessary to implement the functionality described in TM#1. _______________ 8 Once these file operations are defined. 20 November 1984 Athena TM 4 D R A F T 10 - _D_i_s_t_r_i_b_u_t_e _t_h_e _f_i_l_e_g_r_o_u_p _s_e_r_v_e_r: Add crash-recovery to the distributed database server. That is, should the host on which the distributed-database manager resides crash, another incarnation of the manager needs to pop up. This step may have two sub-steps: 1. When the database server crashes, existing open file connections are disrupted (that is, recovery from database manager host crashes is not invisible to people using files in the filesystem. 2. Make 9database manager recovery invisible to users. - _A_d_d _r_e_p_l_i_c_a_t_e_d _f_i_l_e_s. At this point, a number of system calls may be added, which only work on those database/filesystem managers which implement replicated files. This shouldn't cause much of a problem, after all, there are some system calls (e.g., seek) which work on one kind of file, but not on another. Putting the intelligence necessary to locate files into a namespace manager allows us to hide the details of how such location happens from client kernels and client programs. This will allow us to implement file location many times until we get it right. An additional note: once the protocols are defined, it will allow UNIX files to be accessible to programs on other machines running other operating systems. _______________ 9 Some handwaving ensues to get you to ignore the fact that _b_o_t_h ends of the connection have to know how to recover from database server crashes for such crashes to be invisible. Later work will have to define a server recovery protocol for use in this case, the namespace-name server will, in turn, tell nnnaaammmeeeiii when a namespace manager uses the recovery protocol. 20 November 1984 Athena TM 4 D R A F T 11 555 AAAuuuttthhheeennntttiiicccaaatttiiinnnggg fffiiillleee aaacccccceeesssssseeesss While within the Athena system we can confidently say that individuals will have unique UIDs throughout the network, we cannot say that this will be true throughout the entire set of machines which partake in the Athena remote access system. Also, other institutions which adopt our system may not be able to assume this. If you cannot trust a machine's kernel then you cannot trust requests that come from that machine without sealing each request packet with a digital signature. Current algorithms for producing reasonably secure digital signatures are too slow without special encryption hardware. In a network the size of Athena, secure key distribution will be a very difficult problem, and the hardware for public key systems does not yet exist (nor are public key systems blessed as being truly secure). I think the costs of a digital signature approach to sealing each packet are prohibitive, therefore, I think we should build a system which requires being able to trust the operating system on a machine. Each remote access request packet will contain the UNIX UID of the user on the client machine. The server will do what it wants with this information. Locally, on Athena systems, the server will just use this UID as the UID for local operations. Other systems may perform other translations on based on this data. Potential strategies are: 1. I don't trust that host at all. Permission denied. 2. I only trust that host when it claims to be user #23 (which I map to my own user #42). 3. I map all users from that host to my local user, ITS-Guest. Another thing to be concerned about is the size of a UID. UNIX UIDs are only 16 bits long. Twenex UIDs are 36 bits long (as are, I suspect ITS UIDs), and Multics UIDs are 32 _c_h_a_r_a_c_t_e_r_s long. We could decide to use a 32 bit field (since Project Athena is likely to go to 32-bit UIDs in the near future) and argue that machines which don't use 32-bit UIDs are going to be coupled loosely enough to the Athena systems that they can perform their own mapping of their local UID to a 32-bit Athena network ID. While the Athena machines are tightly coupled (so that the work of noticing that files are off the local machine will be done by the operating system) it seems likely that Twenex and Multics machines will use special user programs to access the UNIX filesystems via this remote access protocol, especially since the protocols are so closely modelled on the relatively limited and primitve UNIX file operations. 20 November 1984 Athena TM 4 D R A F T 12 666 TTThhheee nnnaaammmeeessspppaaaccceee ppprrroootttooocccooolll Two different kinds of interchanges take place in locating a file and establishing a connection to the file. The requests to the namespace-name server and to the namespace manager are small, and require only a single message of request and a single message of reply. If a reply is not forthcoming, the request can be repeated. The exchanges between a user program and an open file however, may require multiple packets which must be reliably delivered, and must be delivered in order. I think this division indicates the underlying protocols to be used for the exchanges: requests to the namespace-name server and replies from the namespace-name server are UDP datagrams. The initial pathname request to the namespace manager, and the manager's reply are also carried by UDP datagrams. A file is opened by opening a full-duplex TCP connection to the agent (the channel structure is a specification for the TCP connection), and descriptor-oriented requests and their replies travel through the TCP connection. Thus, pathname-oriented system calls happen by means of UDP datagrams, descriptor-oriented system calls happen by means of a TCP connection. In the context of remote access, a descriptor and a TCP connection may be considered synonymous. Recall that when looking for an object, the namespace manager may return a descriptor for accessing the file, or it may return another file name to use in continuing the search for the object. As an optimization for those cases where the namespace manager would return a remote access descriptor, which would then be used to perform an operation (e.g., aaacccccceeessssss,,, ccchhhmmmoooddd) we piggyback the operation request on top of the initial request to locate the object, so only one message to a namespace manager is necessary. In that case the namespace manager returns a positive result or a negative result. In fact, ooopppeeennn is the only system call which does not follow this pattern, and is thus the only system call which may actually return a remote access descriptor. Another way of saying this is that ooopppeeennn is the only system call which turns a pathname into a descriptor. 666...000...111 DDDaaatttaaagggrrraaammm uuunnniiiqqquuueeennneeessssss aaannnddd ooorrrdddeeerrriiinnnggg It is important that datagram-based requests arrive in order (imagine the consequences of an uuunnnllliiinnnkkk requested before a cccrrreeeaaattteee arriving after the cccrrreeeaaattteee is performed) and that repeated packets be ignored. To allow the proper sequencing of requests, the user sends a request, then waits for a reply to the request before sending another request. To allow requests repeated because a reply was not forthcoming to be discarded, each request packet contains a process-ID and a sequence number. The only thing that is guaranteed about the sequence number is that during any particular five minute period the sequence number for a given 20 November 1984 Athena TM 4 D R A F T 13 32 process-ID increases monotonically (modulo 2 ), not that it increases by increments of one. This allows the user program to use a single sequence number for all the datagram requests it sends to all the hosts with which it corresponds. After five minutes the server may forget the last sequence number it received from a particular process-ID. This is probably sufficient to allow a host to crash and restart and allow new processes (coincidentally using process-IDs which were in use before the crash) to communicate with old servers who might otherwise remember the old process-ID's sequence numbers. Therefore, no provision need be made to remember process-IDs or sequence numbers across crashes--the sequence number knowledge 10 will time-out and be forgotten. 666...111 DDDaaatttaaa ssstttrrruuuccctttuuurrreeesss aaannnddd pppaaaccckkkeeettt fffooorrrmmmaaatttsss 666...111...111 GGGeeennneeerrraaalll rrreeeqqquuueeesssttt pppaaaccckkkeeettt fffooorrrmmmaaattt A request from a user to an object manager (or to the namespace name server) is a packet encapsulated in a UDP datagram (or may be transmitted across a TCP connection to the server). It consists of a header and data. The packet format is: _______________ 10 There is a slight problem with this scheme: it assumes processes whose active lives are short relative to the amount of time a host is up. Many daemons are started as soon as the host comes up and are active throughout a host's lifetime. In UNIX, since daemon processes are all started up in the same order by /etc/rc every time, they also always come up with the same UNIX process-ID. Therefore using UNIX process IDs as process-IDs will have some problems (which I think are solvable, but the means of solving this problem which come to mind are too ugly for me to want to describe them here). 20 November 1984 Athena TM 4 D R A F T 14 struct request_packet { char protocol_version; char reserved; short request_type; long uid; long nonce; long hint; long process_id; long sequence_number; union { struct { short name_length; char name[ARBITRARY_LENGTH]; } namespace_name; struct argstruct request_arguments; } requ; }; The structure elements are: - ppprrroootttooocccooolll___vvveeerrrsssiiiooonnn -- The protocol version number is included to allow us to change the protocol in the future, and to detect attempts to use older versions (if not to actually remain compatible with older versions). - rrreeeqqquuueeesssttt___tttyyypppeee -- The request type is a constant defining which system call this packet represents, or if the packet is a request to a namespace manager. - uuuiiiddd -- A 32-bit quantity which identifies the user on whose behalf the request is being made (see the sections on authentication). - nnnooonnnccceee -- A nnnooonnnccceee is a number which is interpreted only by the user machine, and is ignored by the server. The server simply returns the nonce unchanged in its reply packet. - hhhiiinnnttt -- The hhhiiinnnttt is a number interpreted only by the server end. It may be 0 (and will be in all pathname-oriented requests), or it is the number which the namespace manager returned in the rpc_channel reply when returning a descriptor for the file. The user end of the exchange puts no interpretation on the hhhiiinnnttt, and 11 simply passes it back to the server. - ppprrroooccceeessssss___iiiddd and ssseeeqqquuueeennnccceee___nnnuuummmbbbeeerrr -- These elements _______________ 11 I owe the idea of the nnnooonnnccceee and the hhhiiinnnttt to Mike Greenwald's description of the MIT-LCS Remote Virtual Disk protocol (Greenwald, personal communication). 20 November 1984 Athena TM 4 D R A F T 15 provide a means of detecting and discarding duplicate packets. For a description of the mechanism used, see the section "Datagram uniqueness and ordering". - rrreeeqqquuueeesssttt___aaarrrggguuummmeeennntttsss are the arguments to the system call. Their format varies from request type to request type, and is described below. - nnnaaammmeeessspppaaaccceee___nnnaaammmeee The namespace name is passed as a string length followed by a string of bytes (in VAX, PDP11, and NS16032 ordering of bytes in words). 666...111...222 GGGeeennneeerrraaalll rrreeeppplllyyy pppaaaccckkkeeettt fffooorrrmmmaaattt The server's reply to a user's request may also be encapsulated in a datagram or shipped across a TCP connection. The packet has the following structure: struct reply_packet { char protocol_version; char reserved; short request_type; long nonce; long hint; long process_id; long sequence_number; struct rpc_reply rpc_reply; }; The elements of this structure are: - ppprrroootttooocccooolll vvveeerrrsssiiiooonnn -- Means the same thing as in the request header. - rrreeeqqquuueeesssttt___tttyyypppeee,,, nnnooonnnccceee,,, ppprrroooccceeessssss___iiiddd,,, ssseeeqqquuueeennnccceee___nnnuuummmbbbeeerrr -- are all copied from the header of the request to which this is a reply. - hhhiiinnnttt -- The user end will save this hint for use in future requests. 666...111...333 TTThhheee ssstttrrruuuccctttuuurrreee aaa pppaaattthhhnnnaaammmeee rrreeemmmooottteee ppprrroooccceeeddduuurrreee cccaaallllll rrreeetttuuurrrnnnsss The ppprrrpppccc routine, which implements the PATHNAME object, and the dddrrrpppccc routine, which implements the DESCRIPTOR object, return an rrrpppccc___rrreeeppplllyyy structure embedded in a rrreeeppplllyyy___pppaaaccckkkeeettt structure. The rrrpppccc___rrreeeppplllyyy structure looks like this: 20 November 1984 Athena TM 4 D R A F T 16 struct rpc_reply { char reply_type; char reserved; short length_of_data; union { short result; long long_result; short errno; char string[ARBITRARY_LENGTH]; struct rpc_channel channel; } data; } This is composed of the following parts: - tttyyypppeee is one of RPC_CHANNEL, PATHNAME, DATA, POSITIVE_RESULT, or NEGATIVE_RESULT, and is used to tell the calling routine how to interpret the union. [dddrrrpppccc will never return RPC_CHANNEL or PATHNAME, since they are only involved in locating the file given a pathname.] - llleeennngggttthhh___ooofff___dddaaatttaaa is the length of the structure contained in the following union. For example, if the type of reply is PATHNAME, then llleeennngggttthhh___ooofff___dddaaatttaaa is the length of the pathname in bytes, if the type of the reply is IPADDR, then llleeennngggttthhh___ooofff___dddaaatttaaa is sssiiizzzeeeooofff(((rrrpppccc___ccchhhaaannnnnneeelll))). - rrreeesssuuulllttt is the integer result that the system call returns (most system calls return integers of one sort or another). - lllooonnnggg___rrreeesssuuulllttt is provided for those system calls (lllssseeeeeekkk) which return a lllooonnnggg rather than an iiinnnttt, and is not tl stric12y necessary in a 32-bit implementation of UNIX. - eeerrrrrrnnnooo is the error number to stick into the global variable eeerrrrrrnnnooo. This interpretation is used only when the type of the reply is NEGATIVE_REPLY. - ssstttrrriiinnnggg is used for PATHNAME returns, and can be any length, up to the maximum length of a pathname, which is currently 1024 characters. ssstttrrriiinnnggg is also used to _______________ 12 As a former PDP-11 UNIX programmer I view all the hidden "an iiinnnttt is 32 bits long" assumptions that appear in the 4.2BSD sources with extreme distaste. These are _b_a_d programming practices, and the people at Berkeley should know better, especially since they had to _r_e_m_o_v_e things which made the assumptions explicit that were present in the original Version 7 sources. Shame, shame, shame. 20 November 1984 Athena TM 4 D R A F T 17 provide a buffer in which to return the results of any system call (e.g., rrreeeaaaddd,,, ssstttaaattt) which returns a buffer full of data. - ccchhhaaannnnnneeelll is a channel description used to access the file (currently, ooopppeeennn is the only system call that returns one of these). The rrrcccppp___ccchhhaaannnnnneeelll structure is described elsewhere. 666...111...444 SSSuuubbbrrrooouuutttiiinnneee aaarrrggguuummmeeennntttsss Each argument is passed in the argument buffer as a byte-stream containing the data that forms the argument preceded by a 16-bit length-of-data field. Argument lengths always begin on an even-byte boundary. For example, the arguments "Chris" and "Joe" are passed as: 05 "Ch" "ri" "s"- 03 "Jo" "e"- 0x5 0x4368 0x7269 0x7300 0x3 0x4a6f 0x6500 (this is VAX, PDP11, and NS16032 order). A number of unions make it easier to access most system calls from C routines (although note that the order of the arguments here is _n_o_t the same as if these were real remote procedure calls). struct argstruct { union { struct mode_struct{ short mode_len; long mode; short name_len; char name[filename_len]; } modes; struct open_struct{ short flag_len; short flags; short mode_len; short mode; short name_len; char name[filename_len]; } open; struct name_struct { short name_len; char name[filename_len]; } name; } argu; } Each of the structures defined in the aaarrrggguuu union are appropriate forms for the arguments of different system calls to take. For an example of the details of one of these structures, see the definition of the ppprrrpppccc routine. 20 November 1984 Athena TM 4 D R A F T 18 666...111...555 RRReeemmmooottteee aaacccccceeessssss dddeeessscccrrriiippptttooorrrsss Analogous to file descriptors, each process has a table of remote access descriptors. A remote access descriptor is a small integer (it fits the same description as a file descriptor) which serves as an index into a table of remote access file structures: struct ral_file { char type; char reserved; short refct; short real_fd; struct rpc_channel channel; } *ralfd_tab[NOFILES]; tttyyypppeee is one of LOCAL or REMOTE. rrreeefffcccttt is a reference count (since multiple virtual file descriptors can point to a single remote access descriptor, by grace of ddduuuppp and ddduuuppp222). There are a couple of routines provided for allocating, and freeing remote access descriptors. /* find an unallocated ralfd_tab entry */ ral_file_alloc() { register int i; for(i = 0; i < NOFILES; i++) if(ralfd_tab[i] == NULL) { if((ralfd_tab[i] = malloc(sizeof(struct ral_file))) == NULL) return(-1); ralfd_tab[i]->refct = 0; return(i); } /* too many files are already open */ return(-1); } /* free an ralfd_tab entry */ ral_file_free(fd) { if(ralfd_tab[fd]->refct-- <= 0) free(ralfd_tab[i]); } 666...111...666 TTThhheee rrreeemmmooottteee aaacccccceeessssss ccchhhaaannnnnneeelll ssstttrrruuuccctttuuurrreee The namespace manager returns one of an error, another pathname to use, or a channel for communicating with the file. The channel is a structure which looks like: 20 November 1984 Athena TM 4 D R A F T 19 struct rpc_channel { long ip_addr; short port; short hint; }; The iiipppaaaddddddrrr field is the IP address of the host on which the manager for this particular object resides. The pppooorrrttt is the UDP port to which to send requests for operations, and the hhhiiinnnttt is a 16-bit field which must be included with all operation requests, and is interpreted solely by the object manager. 666...222 RRReeemmmooottteee ppprrroooccceeeddduuurrreee cccaaallllllsss 666...222...111 TTThhheee pppaaattthhhnnnaaammmeee rrreeemmmooottteee ppprrroooccceeeddduuurrreee cccaaallllll rrrooouuutttiiinnneee The ppprrrpppccc routine takes a varying number of arguments, depending on which system call is being invoked remotely, and packages a remote procedure call request for the remote system. If the namespace manager returns a new pathname to use ppprrrpppccc checks the name to see if it is local, if the new pathname _i_s local, then ppprrrpppccc returns the new pathname to the calling routine, otherwise it locates the new namespace manager, and invokes the operation on the new pathname. The ppprrrpppccc routine looks something like this: 20 November 1984 Athena TM 4 D R A F T 20 struct rpc_reply prpc(system_call, arg1, arg2, arg3, arg4, arg5, arg6, arg7) { struct rpc_request request; /* * Construct a request packet for this call. * First, the invariant part. */ request.protocol_version = PROTOCOL_VERSION; request.request_type = system_call; request.nonce = make_nonce(); request.uid = getuid(); request.hint = 0; request.process_id = ral_pid; request.sequence_number = make_sequence_number(); /* * Now the variable part. */ switch(system_call) { case ACCESS: case CHMOD: case CHOWN: case MKDIR: return prpc_path_mode(&request, arg1, arg2); case LINK: case SYMLINK: return prpc_opath_npath(&request, arg1, arg2); case OPEN: return prpc_open(&request, arg1, arg2, arg3); case RMDIR: case UNLINK: return prpc_path(&request, arg1); case READLINK: return prpc_path_buf_buflen (&request, arg1, arg2, arg3); case STAT: return prpc_path_buf_buflen (&request, arg1, arg2, sizeof(struct statbuf)); } } Here is a sketch of a possible implementation of the routine which implements aaacccccceeessssss,,, ccchhhmmmoooddd,,, ccchhhooowwwnnn, and mmmkkkdddiiirrr. 20 November 1984 Athena TM 4 D R A F T 21 struct rpc_reply prpc_path_mode(request, pathname, mode) { char *filename; char *namespace_name; struct rpc_reply namespace_reply; request->args.mode.mode_len = sizeof(long); request->args.mode.mode = mode; path_and_mode_name_loop: namespace_name = get_namespace_name(pathname); filename = get_filename((char *) pathname); namespace_reply = namespace_nameserver(namespace_name); switch(namespace_reply->type) { case NEGATIVE_REPLY: return an error indicating a non-existent file; case IPADDR: break; default: return an indication of an illegal reply from the namespace server. } /* * Now we know how to get in touch with the namespace * manager. */ channel = reply->data.channel; request->args.mode.name_length = strlen(filename); copy the string at filename into request->args.mode.name; (this probably involves allocating a buffer for it). reply = send_to_channel(&channel, request); if(reply->type == PATHNAME) { if(is_local_filename(reply->data.string)) return(reply); else { namespace_name = get_namespace_name(reply->data.string); filename = get_filename(reply->data.string); goto path_and_mode_name_loop; } } else return(reply); } The ppprrrpppccc routine returns a pointer to an allocated rpc_reply structure. The calling routine must explicitly garbage collect the structure when it is done with it. 20 November 1984 Athena TM 4 D R A F T 22 666...222...222 DDDeeessscccrrriiippptttooorrr rrreeemmmooottteee ppprrroooccceeeddduuurrreee cccaaallllllsss The DESCRIPTOR object is implemented by the dddrrrpppccc (descriptor remote procedure call) routine, which is largely analogous to the ppprrrpppccc routine. The dddrrrpppccc routine also returns a pointer to an rrrpppccc___rrreeeppplllyyy structure, and it looks like this: 20 November 1984 Athena TM 4 D R A F T 23 struct rpc_reply * drpc(system_call, descriptor, arg1, arg2, arg3, arg4, arg5) { struct rpc_request request; /* * Set up the request packet for this call. */ request.protocol_version = PROTOCOL_VERSION; request.request_type = system_call; request.nonce = make_nonce(); request.uid = getuid(); request.hint = ralfd_tab[descriptor]->channel.hint; request.process_id = ral_pid; request.sequence_number = make_sequence_number(); switch(operation){ case FCHMOD: case FCHOWN: case FTRUNCATE: return drpc_long(&request, ralfd_tab[descriptor], arg1); case CLOSE: case FSYNC: return drpc_none(&request, ralfd_tab[descriptor]); case LSEEK: return drpc_long_long(&request, ralfd_tab[descriptor], arg1, arg2); case READ: return drpc_read(&request, ralfd_tab[descriptor], arg1, arg2); case WRITE: return drpc_write(&request, ralfd_tab[descriptor], arg1, arg2); case FSTAT: return drpc_fstat(&request, ralfd_tab[descriptor], arg1); } } Note that the dddrrrpppccc routine will never return a PATHNAME, since the location of the object has already been determined. Since none of the descriptor-based routines need worry about the complexity of hunting down the file given its pathname, the routines making descriptor-based requests are much simpler. All such routines need do is set up the request packet with the 20 November 1984 Athena TM 4 D R A F T 24 proper format of arguments, and send them along: struct rpc_reply * drpc_long(request, ralfd, arg) struct rpc_request *request; struct ral_file *ralfd; long arg; { request->args.mode.mode_len = sizeof(long); request->args.mode.mode = arg; return send_to_channel(ralfd->channel, request); } 20 November 1984 Athena TM 4 D R A F T 25 777 SSSttteeeppp ooonnneee::: ttthhheee MMMIIITTT rrreeemmmooottteee aaacccccceeessssss llliiibbbrrraaarrryyy The first step will be to provide remote access in the a manner similar to the Newcastle Connection or Harvard's RFS. That is, a library of system call emulators which implement the remote access protocols on those file descriptors that designate "remote files". These "system calls" will actually be user-mode implementations, as was done with the first Newcastle Connection. Accompanying the user-library will be a user-mode remote-access server, which handles system-call requests from the library routines. The server can begin as a simple UNIX filesystem manipulator on a remote host, and can also be viewed as implementing a sort of remote procedure call service (the procedure calls are limited to UNIX system calls, however). An examination of section 2 of _T_h_e _U_N_I_X _P_r_o_g_r_a_m_m_e_r'_s _M_a_n_u_a_l, Volume 1, reveals that there are about 33 system calls which will need to be heavily modified to simulate the remote access system, and 19 more which need to be slightly modified because of the games which must be played with file descriptors in a non-kernal implementation. Many of these calls (e.g., ssstttaaattt,,, fffssstttaaattt and lllssstttaaattt) are actually different interfaces to the same system call, making this conversion task somewhat less daunting. The key to the implementation is that the library of system calls maintain their own global data structure which describes each of the file descriptors accessible to the process as being either a real file descriptor (in which case the ordinary action of the system call takes place) or a remote-access file pt descri13or (in which case the remote access protocols are used). Such data structures are analogous to those used by the Standard-I/O library. Ultimately the library routines will be moved into the kernel, for the following reasons: 1. _P_e_r_f_o_r_m_a_n_c_e. With the library scheme it is likely that every read and write operation will involve copying the data at least once (although scatter- gather I/O may reduce the copying that is necessary). 2. _A_u_t_h_e_n_t_i_c_a_t_i_o_n. As discussed elsewhere in this paper it is easier to trust the kernel on another machine _______________ 13 Note: this data structure has to be inheritable by child processes so that the library routines used by children _a_l_s_o know which file descriptors are remote-access descriptors. This is one of the kludges made necessary by user-mode implementations. 20 November 1984 Athena TM 4 D R A F T 26 than it is to provide a mechanism for trusting a random packet that comes to you from a user process on the other machine. 3. _A_d_a_p_t_a_b_i_l_i_t_y. Without dynamic linking (which is another way of providing adaptability), fixing bugs or adding features to routines in a library requires recompiling every program which uses the routines. Fixing bugs in the kernel _o_n_l_y requires recompiling the kernel and rebooting the machine. There will inevitably be problems that are revealed only when the programs "go public", fixing these problems in a library will be a phenomenal headache (particularly since we won't have control over all the software which uses the library). 777...111 AAA rrreeemmmooottteee aaacccccceeessssss sssccceeennnaaarrriiiooo To clarify some of the notions described here, I'll present an example of how a remote access may take place. This example describes an initial implementation of the remote access system, rather than something more sophisticated. I think this example will remain relatively accurate as far as the client end is concerned, but the implementation of the server end (with a server and a separate process which serves as an agent for transactions on standard UNIX files) will no doubt be different for different object managers. A process (compiled with the remote access library) initiates remote access of a file by giving the ooopppeeennn library routine a distinguishable filename. If the pathname passed to the ooopppeeennn call specifies a remote file, then the remote file protocol is engaged, otherwise the open call is passed to the local operating system. The user's C program says, essentially: fd = open("/@namespace/filename", openflags, modes); if(fd < 0) error(); Which results in: 1. The ooopppeeennn subroutine passes the filename to the nnnaaammmeeeiii routine. nnnaaammmeeeiii scans the name of the file. Since the filename begins with a slash ('/') followed by an at-sign ('@'), the characters up to the next slash (or the characters up to the null which terminates the string if there is no slash in the string after the @) are taken to be the name of a namespace. 2. nnnaaammmeeeiii passes the namespace name to a namespace-name 20 November 1984 Athena TM 4 D R A F T 27 14 server. The namespace-name server returns a message indicating how to get in touch with the namespace- manager (or possibly an error). 3. The text in the filename after the slash which terminated the namespace name up to the string- terminating null is passed to the namespace manager, which scans the name, and sends a reply packet, which is one of three things: a. An error return. b. A capability to access the file (i.e., some descriptor which the client uses to refer to this particular file when talking to the server). This capability may take the form of a socket specification on this or another host (the format of the socket specification can be the same as that returned by the namespace-name server). c. Another filename to use instead, causing nnnaaammmeeeiii 15 to begin its file search from the beginning. In other words, the namespace manager returns either an error or an address of the file. The address may er be eith16 in the internet domain or in the UNIX domain. The namespace manager, when it returns a file capability (which includes a host address and port number to connect to), arranges for someone to be awaiting a connection on that capability. Initially the process waiting for the connection will be a _______________ 14 The first implementation of this will probably just be a subroutine, which has the location of the appropriate nameservers compiled in. 15 This is the feature which allows many different name services to be provided, as well as implementing machine-spanning symbolic links. For example, the '~' feature of the C shell could be generalized using this mechanism by special casing '~' to be synonymous with '/@~/', and constructing a '~' namespace manager. One could also provide various other virtual names (e.g., "/@lpt") to be accessible from all programs. 16 Incidentally, the namespace-name server _a_l_s_o returns either an error or an address. The protocols for these are likely to be the same. 20 November 1984 Athena TM 4 D R A F T 28 separate process which will serve as the user's _a_g_e_n_t for transactions on this file. This agent can change to the user's UID and the user's group membership to allow the regular UNIX protection mechanisms to work. 4. Having received the capability from the server, nnnaaammmeeeiii returns to the ooopppeeennn, which updates the remote-access library's shadow file-descriptor map and stores the capability for use by the other filesystem commands. 5. The ooopppeeennn routine returns a shadow-file-descriptor for 17 this remote-access file. In other words, for debugging purposes, we are moving a good deal of nnnaaammmeeeiii (along with gggeeetttfff) into user mode. Subsequently, the user program issues a system call, e.g., a wwwrrriiittteee, using a "virtual file descriptor". The remote access library performs the file-descriptor mapping (in the case of ordinary files) or sends off a request to the agent for that file. The agent, receiving a request from the user program can simply translate the request into a real system call on its machine, and then translate the results into a reply packet. Servers which manage objects which aren't UNIX files will, of course, be implemented differently. 777...222 AAAuuuttthhheeennntttiiicccaaatttiiinnnggg ttthhheee rrreeemmmooottteee aaacccccceeessssss llliiibbbrrraaarrryyy I don't think the remote-access protocol can be authenticated without changes to the kernel (barring use of digital signatures, which I believe to be beyond the scope of a simple implementation). One can choose to ignore the authentication problems until the client end of the remote access library is moved into the kernel (meaning that the user-mode implementation can be used solely for debugging purposes). An alternate approach would be to stick a simple authentication _______________ 17 The remote-access library maintains complete control over the actual file-descriptors the user program uses, mapping the file-descriptors the user sees to real file-descriptors or remote-access capabilities. The user can't simply use the file-descriptor returned from a real ooopppeeennn system call, as that descriptor could already be in use as a remote file. 20 November 1984 Athena TM 4 D R A F T 29 protocol into the kernel (this protocol will accept datagrams from the user and wrap some kind of authenticating information, such as the UID on this system around the datagram, analogous to the way a TCP datagram is wrapped in an IP datagram). Just as the rrrlllooogggiiinnn protocol uses restricted IP ports (the ports are restricted to use by the root), the remote access protocol can also use restricted IP ports (the restriction here is that ordinary users can use these ports only if they employ the authentication protocol). Since the authentication information is stuck onto the datagram in an unforgeable way by the kernel, again the datagrams are as trustworthy as the kernel on the remote host. Either moving the user routines into the kernel, or implementing an authenticated datagram protocol will make the cc remote a18ess protocol as trustworthy as the kernel on the client machine. I intend to just punt the issue of _u_n_f_o_r_g_e_a_b_l_y authenticating the user-mode implementation, and will only let real people use the system once the user end of the system is moved into the kernel. 777...333 AAAnnn ooovvveeerrrvvviiieeewww ooofff UUUNNNIIIXXX fffiiillleeesssyyysssttteeemmm---ooorrriiieeennnttteeeddd sssyyysssttteeemmm cccaaallllllsss UNIX filesystem-oriented system calls have these kinds of arguments: - DESCRIPTOR (int) -- a file descriptor. The remote access library will have to map virtual file- descriptors, seen by the user program, either to real file-descriptors or to remote access channels. Every system call which takes a file-descriptor as an argument or returns a file-descriptor as a result will have to be filtered through the remote access library's descriptor-mapping routines. - PATHNAME (char *) -- a pathname. All of the system calls which use a pathname as an argument will have to use the remote-access library's version of nnnaaammmeeeiii to locate the file. - BUFFER_ADDRESS (char *) -- the address of a buffer in which to put results. - BUFFER_LENGTH (int) -- the length of the buffer _______________ 18 In the Athena environment, this is may be not very trustworthy at all, but we run that risk with the rlogin protocol, too. 20 November 1984 Athena TM 4 D R A F T 30 - BITMASK (int *) -- a new type of argument introduced in 4.2bsd, the bitmask is used by ssseeellleeecccttt to choose those file-descriptors the user wants ssseeellleeecccttt to act on--it is an integer pointer because it is used as a place in which to return information. Modifying the bitmasks is a side effect of the ssseeellleeecccttt system call. The ssseeellleeecccttt system call will have to be filtered through the remote-access library to have the bits in its bitmasks re-arranged, as the user program uses them to denote virtual file-descriptors, and operating system expects them to be real file-descriptors. This re-arranging has to be done both in calling the real select call and in rearranging the return values. - ints in which various flags (as in the ooopppeeennn system call) or modes (as in the ccchhhmmmoooddd system calls) or counts (as in the ssseeellleeecccttt system call) are passed. - There are some specialized structures (such as the ssstttaaatttbbbuuufff used by the ssstttaaattt system calls, and the ssstttrrruuucccttt msghdr, used by the rrreeecccvvvmmmsssggg and ssseeennndddmmmsssggg system calls). Treatment of these is analogous to the way that the buffer passed by the rrreeeaaaddd system call is treated. System calls return these kinds of results: - RESULT -- whether the call succeeded or not (more or less) e.g., ccclllooossseee,,, aaacccccceeessssss. - AMOUNT -- "how much" of something happened e.g., rrreeeaaaddd,,, write. - DESCRIPTOR (int) -- a file descriptor e.g., ooopppeeennn,,, accept. In addition, system calls have various side-effects: - eeerrrrrrnnnooo: system calls have to provide for returning the value of eeerrrrrrnnnooo; from C it looks like eeerrrrrrnnnooo changes as a side-effect of failing system calls (succeeding system calls leave eeerrrrrrnnnooo untouched). - The rrreeeaaaddd (and rrreeeccceeeiiivvveee) family of system calls have the side-effect of sticking data into a buffer. - The iiioooccctttlll,,, fffcccnnntttlll,,, ssseeellleeecccttt and ssstttaaattt system calls all modify the contents of a buffer (or buffers), the address of which is passed to the system call. Any "remote procedure call" model of system calls will have to account for both the return values and the side-effects of system calls. 20 November 1984 Athena TM 4 D R A F T 31 777...444 PPPrrrooobbbllleeemmmaaatttiiiccc sssyyysssttteeemmm cccaaallllllsss There are a number of system calls which will present special problems to the implementor of a remote access system. Some of these system calls may not have to be implemented at all, some of them present puzzling semantics in a distributed environment, and some of them I would just like to defer until later. 777...444...111 MMMkkknnnoooddd,,, mmmooouuunnnttt,,, uuummmooouuunnnttt RESULT mknod( PATHNAME path, int mode, int dev) RESULT mount( PATHNAME special, PATHNAME name, int rwflag) RESULT umount( PATHNAME special) The mmmooouuunnnttt, uuummmooouuunnnttt, and mmmkkknnnoooddd system calls are ones which seem like they will be more trouble to distribute than they are worth. The semantics of a remote mmmooouuunnnttt, or a mmmooouuunnnttt of a remote filesystem are unclear to me. Use the rrrssshhh facility to execute a remote mmmooouuunnnttt. 777...444...222 FFFcccnnntttlll,,, iiioooccctttlll RESULT fcntl( DESCRIPTOR fd, int cmd, int arg) RESULT ioctl( DESCRIPTOR s, int request, char *structure) fffcccnnntttlll and iiioooccctttlll are problematic largely because there are so 19 many forms to them. At the Summer 1983 USENIX conference in Toronto Mike Lesk described the problem with iiioooccctttlll: "It used to be that every manual page except for the ssshhheeellllll was no longer than a single page, now we give Master's degrees in iiioooccctttlll." _______________ 19 Or, more precisely, because they are so poorly designed. 20 November 1984 Athena TM 4 D R A F T 32 The main problem with iiioooccctttlll is that the area pointed to by the ssstttrrruuuccctttuuurrreee argument can be of any size, and that any implementation which passes arguments to a remote iiioooccctttlll has to know about _a_l_l of its many variations. A possibly better way to have implemented the system call in the first place would have been to include an argument which told the size of the structure being passed, so that iiioooccctttlll could have some means of checking that this variation of iiioooccctttlll was being called with the structure it expects, but error checking is a concept which is relatively new in UNIX programs. 777...444...333 RRReeemmmooottteee eeexxxeeecccvvveee RESULT execve( PATHNAME path, char *argv[], char *envp[]) Remote eeexxxeeecccvvveee would be very hard to implement outside of the kernel, because a process has many files in its environment, which must be available to the process. To implement remote eeexxxeeecccvvveee would require passing a translation of the remote access library's data-structures to the new process on the remote machine. The translation is complicated by the fact that local files will be turned into remote files, and capabilities for remote files must be passed on to the eeexxxeeecccvvveeeing system. Yuck. It would be nice to have, however, but later. Meanwhile, use rrrssshhh. 777...444...444 SSSiiigggnnnaaalllsss Once you've done remote execve, you'll have to be able to send signals to the processes on foreign hosts, too. 777...444...555 FFFllloooccckkk RESULT flock( DESCRIPTOR fd, int operation) I intend to skip the fffllloooccckkk system call at first, and implement it later when there is more time. I expect to implement it by having the lock management done by the storage site, and by synchronizing file operations on fffllloooccckkked file-descriptors more carefully. There's likely to be a performance penalty. 777...444...666 RRReeennnaaammmeee 20 November 1984 Athena TM 4 D R A F T 33 RESULT rename( PATHNAME from, PATHNAME to) rrreeennnaaammmeee is atomic. Atomicity across network boundaries is hard. 777...555 SSSyyysssttteeemmm cccaaallllllsss wwwhhhiiiccchhh rrreeeqqquuuiiirrreee dddeeessscccrrriiippptttooorrr mmmaaappppppiiinnnggg The remote access library will have to use some file descriptors to communicate with servers (these descriptors will have to be inaccessible to the user's program), and will also have to pretend that some file descriptors are used to describe remote files (these descriptors will have to be mapped to the values the user's program expects). Therefore, the remote access library will present to the user's program a set of "virtual file descriptors" which the library routines will have to map to real file descriptors before calling on the system. Therefore all system calls which use (or return) some representation of a file descriptor will have to have an interface which passes through the remote access library, in order to map the virtual descriptors presented to the user program by the library to real file descriptors used by the system, or to remote access descriptors used by the library. 777...555...111 SSSoooccckkkeeettt---rrreeelllaaattteeeddd sssyyysssttteeemmm cccaaallllllsss I plan no changes to these system calls, which deal with sockets. One might toy with the idea of defining an Athena domain, which is separate from the Internet domain, but I see no point to it, at least not now. accept bind connect getpeername getsockname getsockopt setsockopt listen pipe recv recvfrom recvmsg send sendto sendmsg shutdown socket socketpair 20 November 1984 Athena TM 4 D R A F T 34 777...555...222 TTThhheee ssseeellleeecccttt sssyyysssttteeemmm cccaaallllll The select system call also doesn't have to be modified to work in the remote access model, as it's really used on those descriptors which represent IPC channels: select on an ordinary file is meaningless. Like the socket system calls, however, the remote access library has to intervene between the user and the real system call in order to perform the descriptor mapping. AMOUNT select( int nfds, BITMASK readfds, BITMASK writefds, BITMASK exceptfds, timeval timeout) 777...666 SSSyyysssttteeemmm cccaaallllllsss tttooo bbbeee iiimmmpppllleeemmmeeennnttteeeddd bbbyyy ttthhheee rrreeemmmooottteee aaacccccceeessssss llliiibbbrrraaarrryyy The following system calls will be implemented in the remote-access library. In the discussion which follows, I present the type of return for the system calls by showing the system calls "declared" as type RESULT, DESCRIPTOR, or something else. Similarly, the arguments to the system call are "declared" according to their type. Any side-effects of the system call are also mentioned (all system calls alter the value of eeerrrrrrnnnooo if they fail). In the discussions below there are examples shown which are written in Pidgin C or in Pidgin Algol. These examples block out le a general imp20mentation, but do not represent any actual implementation. In particular, the examples aren't guaranteed to be syntactically correct or complete. Fencepost errors abound. Also, in the examples below, real system calls (e.g., a wwwrrriiittteee on a local file) are shown with their name prefixed by an underscore (e.g., ___wwwrrriiittteee). Some general principles of implementation shared by all system calls may be stated: - Any system call which takes a PATHNAME argument must use ppprrrpppccc (pathname remote procedure call) to locate the _______________ 20 They especially do not represent how anything written at Bell Labs or AT&T implement anything. All the examples of system call implementations are "reverse engineered" from the UNIX manual pages. 20 November 1984 Athena TM 4 D R A F T 35 file and perform the operation. - Any system call which takes a DESCRIPTOR argument or returns a DESCRIPTOR result must map the virtual file descriptor to the library's internal file descriptor. System calls taking DESCRIPTOR arguments use dddrrrpppccc (descriptor remote procedure call) to perform the operation. 777...666...111 AAAcccccceeessssss RESULT access( PATHNAME pathname, int mode) The aaacccccceeessssss system call is a good example of a system call which takes a PATHNAME argument. The library routine will look like this: 20 November 1984 Athena TM 4 D R A F T 36 access(pathname, mode) { struct rpc_reply *reply = NULL; int retval; if(is_local_filename(pathname)) { name_loop: /* do the real access system call on * the local system. */ retval = _access(pathname, mode); if(reply != NULL) garbage_collect(reply); return(retval); } else { register int result; if(reply != NULL) garbage_collect(reply); reply = prpc(ACCESS, pathname, mode); switch(reply->type) { case POSITIVE_RESULT: result = reply->data.result; garbage_collect(reply); return(result); case NEGATIVE_RESULT: errno = reply->errno; garbage_collect(reply); return(-1); case PATHNAME: pathname = reply->data.string; goto name_loop; default: ralerr("Bad prpc reply in access"); return(-1); } } } Note the use of the pathname optimization mentioned in the description of ppprrrpppccc. If ppprrrpppccc returns a pathname then the pathname is guaranteed to refer to an object on the local machine. All the code referring to the register variable rrreeesssuuulllttt in this example appears for pedagogical purposes only, as an example of how to return integral results from a remote system call. The POSITIVE_RESULT code could actually read: case POSITIVE_RESULT: garbage_collect(reply); return(0); 20 November 1984 Athena TM 4 D R A F T 37 Since aaacccccceeessssss indicates success by a return of 0. 777...666...222 CCChhhmmmoooddd,,, fffccchhhmmmoooddd RESULT chmod( PATHNAME path, int mode) RESULT fchmod( DESCRIPTOR fd, int mode) ccchhhmmmoooddd is more or less identical to aaacccccceeessssss except the code for a local implementation of aaacccccceeessssss is replaced with the code for a local implementation of ccchhhmmmoooddd. fffccchhhmmmoooddd is a good example of a system call which uses a descriptor, and will look like: fchmod(fd, mode) { mode = mask_mode_with_umask(mode); if(ralfd_tab[fd]->type == LOCAL_FILE) { /* do the real fchmod on the local file. */ return(_fchmod(ralfd_tab[fd]->real_fd, mode)); } else { struct rpc_reply *reply; reply = drpc(FCHMOD, rafd, mode); switch(reply->type) { case POSITIVE_RESULT: garbage_collect(reply); return(0); case NEGATIVE_RESULT: errno = reply->errno; garbage_collect(reply); return(-1); default: ralerr("Bad drpc return in fchmod"); return(-1); } } } 777...666...333 CCChhhooowwwnnn,,, fffccchhhooowwwnnn 20 November 1984 Athena TM 4 D R A F T 38 RESULT chown( PATHNAME path, int owner, int group) RESULT fchown( DESCRIPTOR fd, int owner, int group) 777...666...444 CCChhhdddiiirrr,,, ccchhhrrroooooottt RESULT chdir( PATHNAME path) RESULT chroot( PATHNAME path) CCChhhdddiiirrr and ccchhhrrroooooottt both may be viewed as means of telling nnnaaammmeeeiii or ppprrrpppccc to change the way they expand filenames. These system calls put implicit strings onto the beginning of every relative pathname (a relative pathname is one which does not begin with a '/'. This is how I intend to implement them in the library. The implicit string will be able to contain a namespace designator as well as a filename. Some care has to be taken to expand the pathname argument to an absolute pathname if what the user provides is a relative pathname. 777...666...555 CCClllooossseee RESULT close( DESCRIPTOR fd) Because ddduuuppp and ddduuuppp222 allow multiple file descriptors to refer to the same remote access file, close has to depend on the reference count, and actually closes the file only when the reference count goes to 0. 20 November 1984 Athena TM 4 D R A F T 39 close(fd) { if(ralfd_tab[fd]->refct <= 1) { if(ralfd_tab[fd]->type == LOCAL) return(_close(ralfd_tab[fd]->real_fd)); else { struct rpc_reply *reply; reply = drpc(CLOSE, fd); switch(reply->type) { case NEGATIVE_RESULT: garbage_collect(reply); return(-1); case POSITIVE_RESULT: garbage_collect(reply); ral_file_free(fd); return(0); default: ralerr("Bad drpc reply in close"); return(-1); } } } } 777...666...666 DDDuuuppp,,, ddduuuppp222 DESCRIPTOR dup( DESCRIPTOR fd) DESCRIPTOR dup2( DESCRIPTOR oldfd, DESCRIPTOR newfd) The ddduuuppp is simply a means of manipulating the descriptor map, and can be implemented entirely in the library, without sending a message to the server. dup(oldd) { for(i = 0; i < NOFILE; i++) if(ralfd_tab[i] == NULL) { ralfd_tab[oldd]->refct++; ralfd_tab[i] = ralfd_tab[oldd]; return(i); } errno = too many open files error; return(-1); } ddduuuppp222 however is a combination of a ddduuuppp and a ccclllooossseee, and may be implemented that way. 20 November 1984 Athena TM 4 D R A F T 40 dup2(oldd, newd) { if(ralfd_tab[newd] != NULL) close(newd); ralfd_tab[newd] = ralfd_tab[oldd]; ralfd_tab[newd]->refct++; return(newd); } 777...666...777 FFFsssyyynnnccc RESULT fsync( DESCRIPTOR fd) 777...666...888 LLLiiinnnkkk,,, sssyyymmmllliiinnnkkk,,, rrreeeaaadddllliiinnnkkk RESULT link( PATHNAME path1, PATHNAME path2) RESULT symlink( PATHNAME path1, PATHNAME path2) RESULT readlink( PATHNAME path, BUFFER_ADDRESS buf, BUFFER_SIZE sizeof(buf)) 777...666...999 LLLssseeeeeekkk AMOUNT lseek( DESCRIPTOR fd, int offset, int whence) 777...666...111000 MMMkkkdddiiirrr RESULT mkdir( PATHNAME path, int mode) 777...666...111111 CCCrrreeeaaattt,,, ooopppeeennn 20 November 1984 Athena TM 4 D R A F T 41 RESULT creat( PATHNAME path, int mode) DESCRIPTOR open( PATHNAME path, int flags, int mode) The cccrrreeeaaattt call from earlier UNIX systems has been subsumed by the ooopppeeennn system call, which is how it will be implemented: creat(pathname, mode) { return(open(pathname, O_CREAT, mode)); } The ooopppeeennn system call is more complex than the other system calls which take a pathname argument, so it will be instructive to sketch out an implementation here. As with all the other PATHNAME system calls, the operation request is piggybacked on top of the location request (this is all done within ppprrrpppccc). 20 November 1984 Athena TM 4 D R A F T 42 open(pathname, flags, mode) { int rafd; struct rpc_reply *reply; if((rafd = ral_file_alloc()) == -1) return(-1); if(is_local_filename(pathname)) { name_loop: if((ralfd_tab[rafd]->real_fd = _open(pathname, flags, modes)) < 0) { ral_file_free(rafd); if(reply != NULL) garbage_collect(reply); return(-1); } else { ralfd_tab[rafd]->type = LOCAL; ralfd_tab[rafd]->refct++; if(reply != NULL) garbage_collect(reply); return(rafd); } } else { reply = rpc(OPEN, pathname, flags, mode); switch(reply->type) { case POSITIVE_RESULT: ralfd_tab[rafd]->refct++; ralfd_tab[rafd]->type = REMOTE; ralfd_tab[rafd]->channel = reply->data.channel; garbage_collect(reply); return(rafd); case NEGATIVE_RESULT: errno = reply->errno; garbage_collect(reply); ral_file_free(rafd); return(-1); case PATHNAME: pathname = &reply->data.string; goto name_loop; default: ralerr("Bad prpc reply in open"); return(-1); } } } 777...666...111222 RRReeeaaaddd,,, rrreeeaaadddvvv 20 November 1984 Athena TM 4 D R A F T 43 AMOUNT read( DESCRIPTOR fd, BUFFER_ADDRESS buf, BUFFER_SIZE amt) AMOUNT readv( DESCRIPTOR fd, struct iovec *iov, int iovcnt) From the point of view of the library routines, requests to the server and replies may be asynchronous. From the point of view of the program which uses the remote-access library, the subroutine calls are synchronous, just as UNIX system calls are. We can employ the same optimization on reads that the UNIX filesystem driver uses, by following a read request with a request for the next block of the file, and buffering the reply. Reading and writing will use the same buffering techniques as the Standard I/O library (that is, actual reads and writes happen in large increments). The rrreeeaaaddd and rrreeeaaadddvvv library routines should implement the read-ahead discipline themselves. rrreeeaaadddvvv can be implemented (at least initially) as several calls on rrreeeaaaddd: readv(fd, iov, iovcnt) { register int i; register int retval = 0; for(i = 0; i < iovcnt; i++, iov++) { register int tmp; tmp = read(fd, iov->iov_base, iov->iov_len); switch(tmp) { case -1: if(retval == 0) return(-1); else return(retval); case 0: return(retval); default: retval += tmp; } } } rrreeeaaaddd will perform read-ahead buffering, etc. 20 November 1984 Athena TM 4 D R A F T 44 read(fd, buf, len) { if(rafd[fd]->type == LOCAL) return(_read(fd, buf, len); else { while(there is still room in buf AND haven't read EOF from the file) { if(the input is already buffered) while(there is still room in buf AND the data has not been exhausted) copy the buffered data into buf; update pointers into the buffer; } if(there is (still) room in buf) { send a read request for lots more than would fit in buf; wait for a reply; } } } } 777...666...111333 RRRmmmdddiiirrr RESULT rmdir( PATHNAME path) 777...666...111444 TTTrrruuunnncccaaattteee,,, ffftttrrruuunnncccaaattteee RESULT truncate( PATHNAME path, int length) RESULT ftruncate( DESCRIPTOR fd, int length) 777...666...111555 UUUmmmaaassskkk int umask( int mask) The uuummmaaassskkk system call is not passed to the server by the library. The mmmaaassskkk is applied to the relevant arguments to other system calls (e.g., cccrrreeeaaattt,,, ccchhhmmmoooddd) by the library implementations of calls before the arguments are sent off to the server. For an example of this kind of use, see the description of the fffccchhhmmmoooddd routine. 20 November 1984 Athena TM 4 D R A F T 45 777...666...111666 UUUnnnllliiinnnkkk RESULT unlink( PATHNAME path) 777...666...111777 WWWrrriiittteee,,, wwwrrriiittteeevvv AMOUNT write( DESCRIPTOR fd, BUFFER_ADDRESS buf, BUFFER_SIZE sizeof(*buf)) AMOUNT writev( DESCRIPTOR fd, struct iovec *iov, int iovlen) As described in the section on rrreeeaaaddd and rrreeeaaadddvvv, wwwrrriiittteeevvv may be implemented as a series of calls to wwwrrriiittteee. 777...666...111888 SSStttaaattt,,, lllssstttaaattt,,, fffssstttaaattt RESULT stat( PATHNAME path, *statbuf) RESULT lstat( PATHNAME path, *statbuf) RESULT fstat( DESCRIPTOR fd, *statbuf) 777...777 SSStttrrruuuccctttuuurrreee ooofff aaannn iiinnniiitttiiiaaalll nnnaaammmeeessspppaaaccceee mmmaaannnaaagggeeerrr The initial namespace manager I plan to implement will be a namespace manager which manages the UNIX files on the machine on which the manager is running. This manager will appear in the namespace-name space with the name of the machine on which it runs. The manager will consist of two halves: a half which listens on a well-known port for pathname-based requests, and a half which serves as a user's agent for descriptor-based requests. The pathname half of the manager will listen for requests on a well-known UDP port (it probably should consult a namespace-name 20 November 1984 Athena TM 4 D R A F T 46 server to find out where it is supposed to listen). This half will perform the actions of the pathname-based system calls. It must perform all of the UNIX access checking itself, since it will no doubt be a privileged process. Fortunately, it should not be hard to move the UNIX kernel's aaacccccceeessssss routine into the 21 namespace manager in a usable form. The descriptor-based half of the namespace manager will be spawned by the pathname-based half in response to a request to open a file. When it receives an open request the pathname-half will respond with a message describing what TCP port to open to talk to the requested file. The descriptor half will fork the pathname half, which will change its identity to that of the requesting user (and change its group memberships to those to which the user may belong) and which will listen for a connection on the correct TCP port. This process will serve as the user's agent in performing actions on the file, and it may use the normal UNIX file-protection mechanisms. This process need only be able to understand the descriptor-based system calls. Both halves of the manager may be built as simple loops, listening for a request, deciphering the request, performing the request, then sending the reply back to the user. _______________ 21 This routine, in a form which takes a user-id as an argument should probably be made available to the general public. Note that this version of the aaacccccceeessssss routine will have to extract the user's group memberships from the /etc/group file. Another likely source for this code is the FTP-daemon. 20 November 1984 Athena TM 4 D R A F T I TTTaaabbbllleee ooofff CCCooonnnttteeennntttsss 1 Introduction 1 2 Goals, and problems with TM#1 2 3 Implementing multiple namespaces 4 3.1 Possible advanced uses of namespace managers 6 3.2 What happened to network transparency? 7 3.3 Distinction between namespace-name servers and 8 namespace managers 4 A filesystem is a database 9 5 Authenticating file accesses 11 6 The namespace protocol 12 6.0.1 Datagram uniqueness and ordering 12 6.1 Data structures and packet formats 13 6.1.1 General request packet format 13 6.1.2 General reply packet format 15 6.1.3 The structure a pathname remote procedure 15 call returns 6.1.4 Subroutine arguments 17 6.1.5 Remote access descriptors 18 6.1.6 The remote access channel structure 18 6.2 Remote procedure calls 19 6.2.1 The pathname remote procedure call routine 19 6.2.2 Descriptor remote procedure calls 22 7 Step one: the MIT remote access library 25 7.1 A remote access scenario 26 7.2 Authenticating the remote access library 28 7.3 An overview of UNIX filesystem-oriented system 29 calls 7.4 Problematic system calls 31 7.4.1 Mknod, mount, umount 31 7.4.2 Fcntl, ioctl 31 7.4.3 Remote execve 32 7.4.4 Signals 32 7.4.5 Flock 32 7.4.6 Rename 32 7.5 System calls which require descriptor mapping 33 7.5.1 Socket-related system calls 33 7.5.2 The select system call 34 7.6 System calls to be implemented by the remote access 34 library 7.6.1 Access 35 7.6.2 Chmod, fchmod 37 7.6.3 Chown, fchown 37 7.6.4 Chdir, chroot 38 7.6.5 Close 38 7.6.6 Dup, dup2 39 7.6.7 Fsync 40 7.6.8 Link, symlink, readlink 40 7.6.9 Lseek 40 7.6.10 Mkdir 40 7.6.11 Creat, open 40 7.6.12 Read, readv 42 7.6.13 Rmdir 44 7.6.14 Truncate, ftruncate 44 7.6.15 Umask 44 7.6.16 Unlink 45 7.6.17 Write, writev 45 20 November 1984 Athena TM 4 D R A F T II 7.6.18 Stat, lstat, fstat 45 7.7 Structure of an initial namespace manager 45 20 November 1984