SIPB AFS Backups

SIPB AFS Backups

Back to the table of contents.

Follow these steps carefully:

  1. Before starting the backup, there are a few things that ought to be checked, since they could result in the backup failing (or being incomplete) unless detected in advance.

    1. See if all the server processes are running:
      	bos status ronald-ann -long -noauth
      	bos status rosebud -long -noauth
      
    2. See if there's enough free space:
      	fs df /afs/sipb/service/partitions/*
      
      Volumes should be moved, before continuing, if any partition is close to full.
    3. See if any volumes are offline or busy:
         vos listvol ronald-ann -noauth
         vos listvol rosebud -noauth
      
      Ideally, all the lines that start with "Total volumes" should have "Total volumes offLine 0 ; Total busy 0". Otherwise, find out if someone else is doing volume operations, or if some volumes need to be salvaged (see part D below).

      Since the listvol output also gives disk-space usage, it be scanned for numbers that look way out of line, e.g., 100 Mb user volumes.

    4. Read all the log files from both servers, i.e., ronald-ann:/usr/afs/logs/*Log and rosebud:/usr/afs/logs/*Log.
      There will be a lot of messages that are actually fairly normal occurrences, and can usually be ignored. These include
         Break call back failed for host
         CB: Call back connect back failed (in break delayed)
         CB: RCallBack (zero fid probe in host.c) failed for host
         CB: RCallBackConnectBack (host.c) failed for host
         Discarded a packet for ########
         VAttachVolume: Cannot read volume header
         fssync: callbacks broken for volume #########
         fssync: volume ######### moved to ########; breaking all call backs
         trans ######## is older than 300 seconds
      
      The most common messages that are of interest are
         Partition /vicepX that contains volume ######### is full
         Volume ######### needs to be salvaged
      
      Both of these indicate a potential need to do bos salvage immediately.
    5. Make sure the backup software and database still exist, e.g.,
         ls -l /afs/sipb/project/newdump /usr/afs/backup
      
      (/usr/afs/backup is a local directory on the /var partition of hodge)
  2. The next step is to find the correct three tapes for the current week's backup. There are four groups of tapes, i.e., tapes are reused every four weeks. Each group of tapes includes one for each of the following three purposes:
    1. ronald-ann, all partitions
    2. rosebud, /vicepa and /vicepb
    3. rosebud, /vicepc
    Each tape has a label (i.e., written to the tape media -- not the one written on the tape package) that identifies the volume set and dump level that it is used for. The names of the volume sets are rann, rbud, and rsqr. These correspond, respectively, to A, B, and C above. That is, a volume set is the collection of all volumes having in common one of these server or server/partition locations.

    The four groups of tapes correspond to four dump levels. The dump-level names are of the form /volume-set-name\_\# (e.g., /rann\_1, /rann\_2, /rann\_3, /rann\_4, /rbud\_1, etc.). There is no particular reason for having separate dump-level names for each volume set, i.e., /foo\_1, /foo\_2, etc. could also have been used. In other words, the dump-level name is just an arbitrary string to distinguish the different tapes used for the same volume set. Regardless of the dump level, the full contents of all volumes in the volume set are written to tape. All levels are equivalent; none is higher or lower than any other.

    The tape label is formed by putting together the volume-set name, dump-level name, and a sequence number. For example, the tape used for volume set rbud and dump level /rbud\_3 has the label rbud.rbud\_3.1. The ".1" means that it is the first tape of the sequence. For sipb-cell backups, there is always just one tape in a sequence, so the label names will always end in ".1".

    (Incidentally, there is no particular reason for having cryptic four-letter volume-set names. Originally, there was only a single volume set for the sipb cell, called sipb. Once the cell had too much data to fit on one tape, two new volume sets replaced it, with names rann and rbud, for compatibility with the original four-letter name. When the R-Squared external drive was connected to rosebud, a third volume set was added, called rsqr.)

    The information written on the tape package will indicate what tape is inside. For example, the rbud.rbud\_3.1 tape will say "ROSEBUD 3", the rsqr.rsqr\_4.1 tape will say "RSQR 4", etc.

    The dump levels are used sequentially. That is, if one week the levels /rann\_2, /rbud\_2, and /rsqr\_2 are used, then the next week /rann\_3, /rbud\_3, and /rsqr\_3 would be used. If the tapes for the current dump levels are not in the SIPB office, they are with the IS Media Storage service in Building 11. They can be retrieved by bringing our tape receipt (storage number CTG007) to 11-226 Monday-Friday from 8 AM - 4 PM. It takes about 5 minutes for the people there to get the tapes. Remember that the tapes should be obtained by Friday if Monday is a holiday. When retrieving tapes, a new set of tapes should be brought there, to swap into storage. The tapes to bring are the second-to-most-recent ones. The most recent tapes should be kept in the office, in case they are needed to restore data.

    If the tapes for the correct dump levels can't be obtained, it is OK to use different dump levels. Using the dump levels sequentially is just a convention (and it makes sense to overwrite older backups, not newer ones); none of the software requires it.

    If it's necessary to use a new tape, or one that was previously used for something else, then a labeltape operation should be done prior to the backup. (Note: usually, the only tapes that should be reused are AFS tapes from previous years. The other tapes in the SIPB office generally have some archival value.)

    To do labeling, start up two separate root shells on hodge. Each of them should be in a PAG with suitable root-instance tokens, and have the cwd /afs/sipb/project/newdump. In the first one, type

       # @sys/backup -cell sipb.mit.edu
    
    and in the second, type
       # @sys/butc
    
    In the first shell there will be a "backup> " prompt. To label a tape as rbud.rbud\_3.1, type
       backup> labeltape rbud.rbud\_3.1
    
    There will be a prompt to hit return in the second shell. The second shell will eventually indicate that the labeling finished (it takes a few minutes). At this point, type quit in the first shell, control-C in the second, and (presumably) unlog and kdestroy in one or both of them.
  3. Before starting the backup, obtain a single root shell on hodge. Get tokens for someone listed in system:administrators and in /usr/afs/etc/UserList on ronald-ann and rosebud. Presumably, these should be in a separate PAG from any other shells on hodge, so that an accidental unlog elsewhere doesn't disturb the backup. One way to do this is run the newpag program, then setenv KRBTKFILE, do a root-instance kinit, aklog sipb, and kdestroy. Root-instance tickets aren't needed for the backup: tokens for only the sipb cell will be adequate. It's best to set umask to 022 or 002 so that the log files on hodge's local disk will be readable.
  4. The backup process itself has two steps: volume cloning and volume dumping. Volume dumping is done separately for each volume set, i.e., rann, rbud, and rsqr. They can be done in any order.

    Volume cloning is done once per server. Volume cloning must be done before volume dumping. It takes much less time (about one-tenth as much). It is probably best to do it immediately before. Volume cloning accesses all of the volumes that will be dumped. If there is a problem with accessing any volume, it will often be detected during cloning, and can be corrected (e.g., with a bos salvage) before dumping. If cloning is done too long before dumping (e.g., a day earlier) there's a possibility that some problem will develop in between the two times. This can cause the backup to be incomplete, since that volume will not be dumped.

    The volume cloning is done via sh scripts. The general idea is to clone all volumes that have names not ending with ".nb" and not beginning with "disk.". First, use mkvbscript to create a script containing the series of vos backup commands:

       cd /afs/sipb/project/newdump
       scripts/mkvbscript servername
    
    Then, run this script, recording its output to a file:
       script vb-script.server\_abbreviation.YYMMDD
       /usr/tmp/vb-script.servername.DDMonYYYY
       ^D
    
    Compress the output file, and save it in project.newdump:
       compress vb-script.volume-set-name.YYMMDD
       mv vb-script.server\_abbreviation.YYMMDD.Z last/
    
    (The server abbreviations are rann and rbud.)

    If any of the vos backup commands failed, do a bos salvage on that individual volume, then run that vos backup command again.

    The volume dumping is done with a perl script. The perl script currently has some problems (mainly: it does not finish cleanly) but it is still a bit more convenient than other methods. The main function of the script is to run the backup program, giving it the command "dump volume-set-name dump-level-name", e.g., "dump rann /rann\_2". The script also starts up the butc program, which controls I/O to the tape drive, and handles interaction between backup and butc using perl's chat2 facility. Before running it, put the correct tape into the tape drive, and wait until the green light on the front is lit. Then, just give the dump arguments on the command line, along with the -noclone switch, e.g.,

       /afs/sipb/project/newdump/backup.pl rann /rann\_2 -noclone
    
    For volume set rann, expect to wait about 4 hours. For rbud or rsqr, it should take less than 2 hours. There will be on the order of 10-50 lines of output reporting the status of the backup. At the end, there should be a final message of the form "Finished doing dump rann.rann\_2 successfully". At this point, hit control-C until the shell prompt comes back (it will take either 2 or 3 control-C's).

    Once the backup of each volume set finishes, find the processes (on hodge) running "@sys/butc" and "@sys/backup -cell sipb.mit.edu" and kill them manually. There will be files in /usr/tmp whose names are backup and butc followed by the date. If the backup failed, they should be deleted. If it succeeded, they should be compressed and stored away in project.newdump, e.g.,

       cd /afs/sipb/project/newdump/last
       compress -c /usr/tmp/backup.960108 > backup.rann.960108.Z
       compress -c /usr/tmp/butc.960108 > butc.rann.960108.Z
    
    The most common failure mode is:
       file\_tm: code = -1, errno = 5
        
       file\_tm: I/O error Writing data
       Error in receiving data
       butm: tape I/O error
       Dump volume-set-name.dump-level-name encountered an error
    
    In this case, the tape should be marked "bad", a new tape should be inserted, and backup.pl should be run again with the same arguments. In general, it's necessary to re-run backup.pl whenever a "Dump ... encountered an error" message is obtained.

    Write an entry on the tape package listing the date of the successful backup, the dump-level name, and username, e.g.,

       01-08-96 /rann\_2 paco
    
  5. After the backup of all the volume sets finishes, the AFS copy of the backup database should be updated. This is for safety, in case the local disk on hodge dies:
       rm -rf /afs/sipb/project/newdump/hodge/*
       cd /usr/afs/backup; tar cf - . | (cd /afs/sipb/project/newdump/hodge; tar xpf
     -)
    
    This would usually be the time to unlog.

MIT SIPB