To: nschmidt@MIT.EDU, vkumar@MIT.EDU Cc: wdc@MIT.EDU, mbarker@MIT.EDU Subject: An Idea On Long Jobs Date: Mon, 27 Jan 1997 15:45:09 EST From: Mike Barker This is a preliminary sketch of a possible approach to dealing with this long outstanding request. Background From time to time, someone raises the question of some sort of "long job" or "batch" processing in the Athena environment. Typically, a small group goes off and studies various approaches to dealing with this. The result usually is some kind of finding that providing a "multiprocessing" system with adequate protection for the various users is hard. My Proposal Suppose we start with the idea that the "long job" user gets a complete cpu to themselves. I.e., they would log in remotely to the system and have complete use of that system for the time they have "signed up" to use it. They would be the only person on that system. This has the advantage that there is no need to protect the users from each other--they are physically separated. And if a user "crashes" their system, they are the only one affected. As for tickets--if the user is using AFS or other facilities needing tickets, they will have to log in at least once within 22 hours and renew them. I.e., they should use the available facilities. The critical change here is that the technical issues (of security with multiple users and so on) are largely removed by this approach. Of course, this raises the question of how we would do signups and administration of the systems. Unless we know definitely that there will be demand for large numbers of such systems, I suggest we start with a simple signup (a web page, for example), and put everyone in the queue. As machines become available (from the pool of machines in the experiment), the "administrator" can simply assign the next person to the next machine. To control access to the machines, they should be set to only allow logins from people in the local password files. The administrator can then either login in remotely (as root) and change the entries in the passwd files or we could develop a small "remote" job to change the password files. How should we allocate time in this system? I'd recommend that users sign up for whatever time they think they need. If they finish early, they can notify the administrator. If they need more time, they can notify the administrator. If desired, we could do some kind of fees and overtime charges. I'll leave that refinement to who ever wants to pursue implementing this from the operational side. So, in summary, I am suggesting that we sidestep the issues of how to deal with multiple batch users on a single system. Instead, simply provide a rack of systems for users to use on a "one user, one cpu" basis. The signup and administration of these systems can be as simple as a single list and manual assignment. I think this simple approach would provide some relief for those users who have an occasional "long job" taking several hours of processing. Mike