% $Source: /afs/sipb/user/warlord/pm/RCS/postmortem.txt,v $ % $Author: mhbraun $ Last weekend SIPB AFS died. These are the sequence of events leading to and during the outage. Rosebud's root partition was logging errors. We needed to replace the root drive. Service was called. Later, bell-atlantic needed to take the machine down. The WRONG thing got done (changing the CellServDBs), possibly causing inconsistencies. The question was how to keep the cell up, with one of the primaries down. Next morning bell-atl came in, gsstark was here, yandros came in, bell-atl showed up, the machine was taken down, marc removed rosebud from ronald-ann's cellservdb, and warlord sent a filsrv. Before bell-atl left they rebooted rosebud, and it came up, which could have caused an inconsistency in the cell. We don't know the state of the CellServDB's at this point. mhbraun didn't check. Bell-atl called at 9am Friday to verify someone was here. gsstark was here, and they came. They arrived, gsstark didn't have the combo for the alarm, so ylsul let them in. mhbraun bos shutdown and halted the machine remotely (11:17 according to the noc) . Two things happened: 1) clients started to wedge, and 2) gsstark realized he needed the service switch key (the medeco) and had locked the office and did not have a key. He got CAC to let him into the office. He got into the office, and jis called. Greg hadn't found the key, and mhbraun told him where to look. Jeff found mhbraun in the zone; mhbraun said he was heading over. After it was apparent that the timeout was longer than it was supposed to be, mhbraun asked for suggestions from jis, and was told to make sure bell-atl could work on the machine. Mhbraun arrived, and bell-atl was working. gsstark had found the key. Mhbraun rebooted ronald-ann to try to unwedge the clients that were wedged (11:52). That failed. He then removed rosebud from the Cellsrvdb and rebooted again (11:53) and the clients unwedged. Bell-atl finished, tlyu, mhbraun, and bell-atl tried to boot the machine, but failed. They tried to install AIX 3.1, but eventually it failed (both 3.1 and 3.2 install media). They didn't know if the AIX 3.2 server binaries were stable (only other server is tardis), so they decided to wait until they could find probe before trying to install AIX 3.2. They then decided they didn't know what they were doing and powered rosebud down. Mail was sent to cfyi, bug-sipb, bug-outland, sipb-afsreq, sipb-staff and several individuals outlining what was not available and what state the machine was in. Later, tlyu talked to probe, who told him to install AIX 3.2 and get AFS binaries from afsdev. He installed and mkserv afs'd, and then accidentally "newfs"'ed the /vicepx paritions by creating the volume groups instead of importing them, losing all the data. jis came in and helped with the cell, and said the volume groups were blown away, so we would have to restore from backups. Then, marc and ckclark brought up the AFS servers and started to restore. Ckclark had problems restoring some volumes to rosebud. Yandros restored some volumes to ronald-ann. Next, all the AFS server processes were shut down. warlord came in on Saturday night and wanted to restore from tape. The restore was having problems due to different volumes getting the same volumeID from the vlsever. Eventually all the volumes got to disk, and a vldb restore was attempted. The old vldb was moved and a vos syncvldb was attempted, but it wasn't created on rosebud! At this point, it was discovered that the CellServDB on ronald-ann did NOT have rosebud in it! Warlord fixed this, restarted the vlservers, and rebuilt the vldb again, successfully this time. He verified that all the volumes on rosebud were online. The other servers were restarted on both machines. He then left to get some sleep, knowing the cell was back up and consistent, volumewise. The ptdb and kadb probably need to be recreated. Errors: 1) The people that worked on rosebud need to communicate more. ----- more stuff from minutes to incorporate into this doc. project.sipb had its quota set to 30M (ckclark is wrong). project.outland lost some state. The pts database needs to be recreated, and the kadb is toast. Maybe we should make a periodic backup of the ptdb? msg from jis: Jeff expressed general unhappiness at this situation. The SIPB Cell is a critical service, since many clients can lose if its managed incorrectly. The competency of the maintainers is in question, now, which has possible effects on future relations. In the end, people should be SURE you know what you're doing, and don't touch things you're not sure about. People don't totally agree with this, since it was a freak accident, the observed outage was unexpected, and real service outage was kept to a minimum. Another thing is that more people should learn how to do things. What we did was acceptable. We made some mistakes but the magnitude is being blown out of proportion. Take the RS/6000 and replace with a Maxine? jis suggested this. Its a good idea, but its going to be difficult to get the hardware. There is an administrative detail of moving the data, but people think its a good solution if we can do it.