To: vkumar@MIT.EDU, nschmidt@MIT.EDU Cc: mbarker@MIT.EDU, wdc@MIT.EDU Subject: Re: More Ideas On Long Jobs Date: Thu, 30 Jan 1997 12:06:54 EST From: Mike Barker These are some further ideas, in part from hallway conversations. Possible Extensions: 1. Where can we get so many machines? we're talking about taking some 120 systems out of service this year in the Athena Renewal. Suppose we took those systems, racked them down in W91, and tried this as an experiment for one year. If the systems "fall over", we simply pull them (i.e., don't waste much if any time/resources on maintenance). If the service is a raging success, maybe we put next year's "Athena Renewal Surplus" systems in more racks and keep growing it. If it stinks, we haven't spent much on it. 2. What about a large class where the faculty tells everyone to run a four hour job? If there are such classes, we can simply do a signup/lottery for slots (give them ...oh, say six hour slots). We should alert faculty that they need to do some preparation for this kind of assignment so that the resources will be available. 3. The more I think about it, the more I like having a small remote job that "rewrites" the password files. That way, when someone says they are done with a machine, the "rewriter" can be cued to 1. copy in the standard base password files 2. add in the next user's password entries This same process could be used with the classroom scheduling (with an automated "cueing" system). 4. What if someone wants to use several machines? If we have a large enough pool, let them. 5. What about someone who needs "large memory" or special disk resources? If there really is enough need for specialized machines, we can add them to the pool. Otherwise, I think we say that's outside of the scope of the service we are providing and go on. 6. Can't we reduce the manual management? If we have the "rewriter" I mentioned in 3, we could do something like: 1. automated signup (puts you in queue - simple list) 2. as machine is released, rewriter gets next name from queue and changes password files. sends mail/zephyr to user saying machine x is ready for your use. 3. when user finishes with machine, runs "release" task, which tells rewriter to get next name and go. 4. we could have this software (a fairly simple system) keep track of the time requested versus usage. At some point it could remind the user that they asked for y hours and they are now on y + 24 hours. At another point it could notify someone that user xxx seems to be overrunning. an interesting variant might be to have the rewriter automatically redo the password file after the requested time had passed (plus something?). this would help make sure that users didn't "sit" on the machines or forget to "release" the machines. 7. Bill had an interesting suggestion about a piece of software that automatically connected the user to a free machine if one was available. As I understood it, part of the idea was that the user wouldn't know the machine name--i.e. the "batch telnet" program would "mask" the machine from the user. We would need to expand that a bit so that it dealt gracefully with all machines in use and allowed the user to reconnect to "their" machine during the time they were using it, but the idea that you don't really know which machine you are using may be useful. I guess part of what I'm doing is trying to nibble off some small pieces of the "long jobs" work and suggest ways that we can tackle those pieces. mike