% $Source: /afs/sipb/user/warlord/pm/RCS/postmortem.txt,v $
% $Author: mhbraun $

		Last weekend SIPB AFS died.  These are the sequence of
events leading to and during the outage.  Rosebud's root partition was
logging errors.  We needed to replace the root drive.  Service was
called.  Later, bell-atlantic needed to take the machine down.  The
WRONG thing got done (changing the CellServDBs), possibly causing
inconsistencies.  The question was how to keep the cell up, with one
of the primaries down.
		Next morning bell-atl came in, gsstark was here,
yandros came in, bell-atl showed up, the machine was taken down, marc
removed rosebud from ronald-ann's cellservdb, and warlord sent a
filsrv.  Before bell-atl left they rebooted rosebud, and it came up,
which could have caused an inconsistency in the cell.
		We don't know the state of the CellServDB's at this point.
mhbraun didn't check.  Bell-atl called at 9am Friday to verify someone was
here.  gsstark was here, and they came.  They arrived, gsstark didn't have the
combo for the alarm, so ylsul let them in.  mhbraun bos shutdown and halted
the machine remotely (11:17 according to the noc) .
		Two things happened: 1) clients started to wedge, and 2)
gsstark realized he needed the service switch key (the medeco) and had locked
the office and did not have a key.  He got CAC to let him into the office.  He
got into the office, and jis called.  Greg hadn't found the key, and mhbraun
told him where to look.  Jeff found mhbraun in the zone; mhbraun said he was
heading over.  After it was apparent that the timeout was longer than it was
supposed to be, mhbraun asked for suggestions from jis, and was told to make
sure bell-atl could work on the machine.
	Mhbraun arrived, and bell-atl was working.  gsstark had found the key.
Mhbraun rebooted ronald-ann to try to unwedge the clients that were wedged
(11:52).  That failed.  He then removed rosebud from the Cellsrvdb and
rebooted again (11:53) and the clients unwedged.  Bell-atl finished, tlyu,
mhbraun, and bell-atl tried to boot the machine, but failed.  They tried to
install AIX 3.1, but eventually it failed (both 3.1 and 3.2 install media).
They didn't know if the AIX 3.2 server binaries were stable (only other server
is tardis), so they decided to wait until they could find probe before trying
to install AIX 3.2.  They then decided they didn't know what they were doing
and powered rosebud down.  Mail was sent to cfyi, bug-sipb, bug-outland,
sipb-afsreq, sipb-staff and several individuals outlining what was not
available and what state the machine was in.
	Later, tlyu talked to probe, who told him to install
AIX 3.2 and get AFS binaries from afsdev.  He installed and mkserv
afs'd, and then accidentally "newfs"'ed the /vicepx paritions by
creating the volume groups instead of importing them, losing all the
data.
		jis came in and helped with the cell, and said the
volume groups were blown away, so we would have to restore from
backups.
		Then, marc and ckclark brought up the AFS servers and
started to restore.  Ckclark had problems restoring some volumes to
rosebud.  Yandros restored some volumes to ronald-ann.  Next, all the
AFS server processes were shut down.
		warlord came in on Saturday night and wanted to
restore from tape.  The restore was having problems due to different
volumes getting the same volumeID from the vlsever.  Eventually all
the volumes got to disk, and a vldb restore was attempted.  The old
vldb was moved and a vos syncvldb was attempted, but it wasn't created
on rosebud!  At this point, it was discovered that the CellServDB on
ronald-ann did NOT have rosebud in it!  Warlord fixed this, restarted
the vlservers, and rebuilt the vldb again, successfully this time.  He
verified that all the volumes on rosebud were online.  The other
servers were restarted on both machines.  He then left to get some
sleep, knowing the cell was back up and consistent, volumewise.  
		The ptdb and kadb probably need to be recreated.


Errors: 

1) The people that worked on rosebud need to communicate more.


----- more stuff from minutes to incorporate into this doc.

		project.sipb had its quota set to 30M (ckclark is
wrong).  project.outland lost some state.  The pts database needs to
be recreated, and the kadb is toast.  Maybe we should make a periodic
backup of the ptdb?

		msg from jis: Jeff expressed general unhappiness at
this situation.  The SIPB Cell is a critical service, since many
clients can lose if its managed incorrectly.  The competency of the
maintainers is in question, now, which has possible effects on future
relations.  In the end, people should be SURE you know what you're
doing, and don't touch things you're not sure about.

		People don't totally agree with this, since it was a
freak accident, the observed outage was unexpected, and real service
outage was kept to a minimum.  Another thing is that more people
should learn how to do things.

		What we did was acceptable.  We made some mistakes but
the magnitude is being blown out of proportion.

		Take the RS/6000 and replace with a Maxine?  jis
suggested this.  Its a good idea, but its going to be difficult to get
the hardware.  There is an administrative detail of moving the data,
but people think its a good solution if we can do it.