To: mbarker@MIT.EDU Cc: rbasch@MIT.EDU Subject: batch systems Date: Mon, 23 Jun 1997 18:29:47 EDT From: Robert A Basch Mike, I have completed the first round of evaluating available batch systems; I decided to write up a summary, which I'm including here, as a way to get discussions going on the next step. Bob I have spent some time looking at what is available in the way of network batch systems, both commercial and public domain, for heterogeneous UNIX workstations. Since Kerberos-based authentication and AFS support are known important requirements for our environment, I focused first on whether any existing system provided such support. For commercial systems, currently only Platform Computing's LSF seems to offer any plausible scheme for operating in a Kerberos/AFS environment. From their documentation, they seem to offer the following capabilities in this area: - Scripts are provided which can be used to transfer tokens from the submission host to the execution host. These scripts can be modified to provide further security, e.g. encryption. - AFS token renewal can be done via a supplied daemon for the AFS server. It seems to work by accepting a list of trusted hosts which are authorized to renew tokens. Note that it also read the Authentication Database file, kaserver.DB0. - Hooks are provided to do external authentication, on both the client and server sides. Kerberos-style sample programs are provided. Questions: Is this "solution" worth looking at any further, i.e. in a formal evaluation? Or should this design be rejected out of hand for our environment? Will their scheme fail here due to local AFS modifications? None of the available public domain batch systems seems to have any acceptable support for a Kerberos/AFS environment. DQS (Distributed Queuing System) does have a feature whereby a job "shepherd" process will renew AFS tokens for the job, but this is accomplished simply by accepting and storing the potentially encrypted user password along with the request, and feeding it periodically to klog--presumably not an acceptable solution here. I have downloaded, built, and done minimal testing on both DQS and NQS (the latter is the most widely-used free UNIX queuing system). Both are heterogeneous, network-based systems, allow the configuration and enforcment of various resource allocations and limits, support copying of the user environment along with the job, and support queue "complexes", i.e. ways of grouping or tagging queues to provide flexibility. DQS offers more features for a batch system, and also seems to have a somewhat better architecture: All submissions are to a master queuing server, which then dispatches jobs to execution servers based on their ability to meet the job requirements specified by the user. Thus, there is no need for the client system to have any information about the queue configuration; all such information is acquired from the master server. (There is a provision for a backup master host, to avoid having a single point of failure). The only program needed on the client is the submit program. However, the client does need access to two configuration text files, one containing the name of the queue master host, the other containing various settings, such as the service port numbers to use. NQS has a more decentralized approach, with no master queuing server; each client system contains a complete queue configuration, and forwards a submitted request directly to the execution server for that queue. Each client would thus need to run its own scheduling daemon, as well as being properly configured to know about available queues. (The queue configuration seems to be contained in non-text files that are managed by the scheduler). NQS lacks certain features that are found in DQS, such as real-time limit on jobs, which could be useful for us. (NQS is more of a general-purpose queuing system, which can be used for print service as well as batch, while DQS is designed specifically as a batch system.) However, I feel that the NQS source code might be easier to work with, and seems as if it might be a somewhat more stable and better supported product, with a better command line interface. (DQS also offers a GUI, but I have not built it). On the other hand, a commercial version of DQS, called CODINE, from Genias Software in Germany, is also available; this is not only a "cleaned up" version of the public domain code, but also will eventually offer true support for a Kerberos/AFS environment. From talking to the US distributor, it appears that their implementation will be based on the forwarding of tickets featured in Kerberos Version 5. No matter which batch system we choose, we will need to implement several pieces ourselves: - We will need to add job start/stop hooks to "prep" the execution system for the job, including adding/removing the user in the passwd file (the password itself should not be needed), cleaning up ticket and other temp files, possibly attaching filesystems, etc. - We will need to tweak the submit authorization to accept a request from a user not in the server passwd file, and to accept a request from a host not configured into the batch system database. The user test could involve a check against a list of currently authorized longjob users. The host test will probably just be a check of the submitting principal's realm. - We will presumably want to minimize the software required on a client system, as much as possible. This may mean putting queue configuration information in a centralized database such as Hesiod, similar to what was done with printcap entries. If we decide to take one of the public domain systems and modify it for our purposes, it will be a major task to Kerberize it. The most feasible approach is probably to implement a Kerberos 5-based system, which could use features such as forwarded or proxy tickets, and postdate-able and renewable tickets. The forwarded or proxy ticket would be used to pass the user credentials from the submitting host along to the execution server; if a master queuing host is used, an intermediate hop is required. A proxy ticket could be used if the only credentials needed to be passed would be for the working (AFS) directory for the job; to allow the job to assume the complete identity of the user, a forwarded ticket would be used. A renewable ticket would be used to permit the job to run beyond the length of a normal session; a postdate-able ticket would be used for a job submitted to run at a later time. Question: Is it feasible to implement a system based on these Kerberos 5 features?