\chapter{Chores}

\section{SIPB AFS Backups}

\begin{itemize}
\item{Before starting} the backup, there are a few things that ought to be
checked, since they could result in the backup failing (or being
incomplete) unless detected in advance.
\begin{itemize}
\item{See if all the server processes are running:}
\begin{verbatim}
   bos status ronald-ann -long -noauth
   bos status rosebud -long -noauth
\end{verbatim}

\item{See if there's enough free space:}

\begin{verbatim}
   fs df /afs/sipb/service/partitions/*
\end{verbatim}

Volumes should be moved, before continuing, if any partition is close
to full.


\item{See if any volumes are offLine or busy:}
\begin{verbatim}
   vos listvol ronald-ann -noauth
   vos listvol rosebud -noauth
\end{verbatim}
\par
Ideally, all the lines that start with "Total volumes" should have
"Total volumes offLine 0 ; Total busy 0". Otherwise, find out if
someone else is doing volume operations, or if some volumes need to be
salvaged (see part D below).
\par
Since the listvol output also gives disk-space usage, it be scanned
for numbers that look way out of line, e.g., 100 Mb user volumes.


\item{Read all the log files from both servers, i.e.,} ronald-ann:/usr/afs/logs/*Log and rosebud:/usr/afs/logs/*Log.
\par
There will be a lot of messages that are actually fairly normal
occurrences, and can usually be ignored. These include
\begin{verbatim}
   Break call back failed for host
   CB: Call back connect back failed (in break delayed)
   CB: RCallBack (zero fid probe in host.c) failed for host
   CB: RCallBackConnectBack (host.c) failed for host
   Discarded a packet for ########
   VAttachVolume: Cannot read volume header
   fssync: callbacks broken for volume #########
   fssync: volume ######### moved to ########; breaking all call backs
   trans ######## is older than 300 seconds
\end{verbatim}
The most common messages that are of interest are
\begin{verbatim}
   Partition /vicepX that contains volume ######### is full
   Volume ######### needs to be salvaged
\end{verbatim}
Both of these indicate a potential need to do bos salvage immediately.


\item{Make sure the backup software and database still exist, e.g.,}
\begin{verbatim}
   ls -l /afs/sipb/project/newdump /usr/afs/backup
\end{verbatim}
(/usr/afs/backup is a local directory on the /var partition of hodge)

\end{itemize}


\item{The next step} is to find the correct three tapes for the current
week's backup. There are four groups of tapes, i.e., tapes are reused
every four weeks. Each group of tapes includes one for each of the
following three purposes:
\begin{itemize}
   \item ronald-ann, all partitions
   \item rosebud, /vicepa and /vicepb
   \item rosebud, /vicepc
\end{itemize}
\par
Each tape has a label (i.e., written to the tape media -- not the one
written on the tape package) that identifies the volume set and dump
level that it is used for. The names of the volume sets are rann,
rbud, and rsqr. These correspond, respectively, to A, B, and C above.
That is, a volume set is the collection of all volumes having in
common one of these server or server/partition locations.
\par
The four groups of tapes correspond to four dump levels. The
dump-level names are of the form /volume-set-name\_\# (e.g., /rann\_1,
/rann\_2, /rann\_3, /rann\_4, /rbud\_1, etc.). There is no particular
reason for having separate dump-level names for each volume set, i.e.,
/foo\_1, /foo\_2, etc. could also have been used. In other words, the
dump-level name is just an arbitrary string to distinguish the
different tapes used for the same volume set. Regardless of the dump
level, the full contents of all volumes in the volume set are written
to tape. All levels are equivalent; none is higher or lower than any
other.
\par
The tape label is formed by putting together the volume-set name,
dump-level name, and a sequence number. For example, the tape used for
volume set rbud and dump level /rbud\_3 has the label rbud.rbud\_3.1.
The ".1" means that it is the first tape of the sequence. For
sipb-cell backups, there is always just one tape in a sequence, so the
label names will always end in ".1".
\par
(Incidentally, there is no particular reason for having cryptic
four-letter volume-set names. Originally, there was only a single
volume set for the sipb cell, called sipb. Once the cell had too much
data to fit on one tape, two new volume sets replaced it, with names
rann and rbud, for compatibility with the original four-letter name.
When the R-Squared external drive was connected to rosebud, a third
volume set was added, called rsqr.)
\par
The information written on the tape package will indicate what tape is
inside. For example, the rbud.rbud\_3.1 tape will say "ROSEBUD 3", the
rsqr.rsqr\_4.1 tape will say "RSQR 4", etc.
\par
The dump levels are used sequentially. That is, if one week the levels
/rann\_2, /rbud\_2, and /rsqr\_2 are used, then the next week /rann\_3,
/rbud\_3, and /rsqr\_3 would be used. If the tapes for the current dump
levels are not in the SIPB office, they are with the IS Media Storage
service in Building 11. They can be retrieved by bringing our tape
receipt (storage number CTG007) to 11-226 Monday-Friday from 8 AM - 4
PM. It takes about 5 minutes for the people there to get the tapes.
Remember that the tapes should be obtained by Friday if Monday is a
holiday. When retrieving tapes, a new set of tapes should be brought
there, to swap into storage. The tapes to bring are the
second-to-most-recent ones. The most recent tapes should be kept in
the office, in case they are needed to restore data.
\par
If the tapes for the correct dump levels can't be obtained, it is OK
to use different dump levels. Using the dump levels sequentially is
just a convention (and it makes sense to overwrite older backups, not
newer ones); none of the software requires it.
\par
If it's necessary to use a new tape, or one that was previously used
for something else, then a labeltape operation should be done prior to
the backup. (Note: usually, the only tapes that should be reused are
AFS tapes from previous years. The other tapes in the SIPB office
generally have some archival value.)
\par
To do labeling, start up two separate root shells on hodge. Each of
them should be in a PAG with suitable root-instance tokens, and have
the cwd /afs/sipb/project/newdump. In the first one, type
\begin{verbatim}
   # @sys/backup -cell sipb.mit.edu
\end{verbatim}
and in the second, type
\begin{verbatim}
   # @sys/butc
\end{verbatim}
In the first shell there will be a "backup> " prompt. To label a tape
as rbud.rbud\_3.1, type
\begin{verbatim}
   backup> labeltape rbud.rbud\_3.1
\end{verbatim}
There will be a prompt to hit return in the second shell. The second
shell will eventually indicate that the labeling finished (it takes a
few minutes). At this point, type quit in the first shell, control-C
in the second, and (presumably) unlog and kdestroy in one or both of
them.


\item{Before starting the backup}, obtain a single root shell on hodge.
Get tokens for someone listed in system:administrators and in
/usr/afs/etc/UserList on ronald-ann and rosebud. Presumably, these
should be in a separate PAG from any other shells on hodge, so that an
accidental unlog elsewhere doesn't disturb the backup. One way to do
this is run the newpag program, then setenv KRBTKFILE, do a
root-instance kinit, aklog sipb, and kdestroy. Root-instance tickets
aren't needed for the backup: tokens for only the sipb cell will be
adequate. It's best to set umask to 022 or 002 so that the log files
on hodge's local disk will be readable.

\item{The backup process} itself has two steps: volume cloning and volume
dumping. Volume dumping is done separately for each volume set, i.e.,
rann, rbud, and rsqr. They can be done in any order.
\par
Volume cloning is done once per server. Volume cloning must be done
before volume dumping. It takes much less time (about one-tenth as
much). It is probably best to do it immediately before. Volume cloning
accesses all of the volumes that will be dumped. If there is a problem
with accessing any volume, it will often be detected during cloning,
and can be corrected (e.g., with a bos salvage) before dumping. If
cloning is done too long before dumping (e.g., a day earlier) there's
a possibility that some problem will develop in between the two times.
This can cause the backup to be incomplete, since that volume will not
be dumped.
\par
The volume cloning is done via sh scripts. The general idea is to
clone all volumes that have names not ending with ".nb" and not
beginning with "disk.". First, use mkvbscript to create a script
containing the series of vos backup commands:
\begin{verbatim}
   cd /afs/sipb/project/newdump
   scripts/mkvbscript servername
\end{verbatim}
Then, run this script, recording its output to a file:
\begin{verbatim}
   script vb-script.server\_abbreviation.YYMMDD
   /usr/tmp/vb-script.servername.DDMonYYYY
   ^D
\end{verbatim}
Compress the output file, and save it in project.newdump:
\begin{verbatim}
   compress vb-script.volume-set-name.YYMMDD
   mv vb-script.server\_abbreviation.YYMMDD.Z last/
\end{verbatim}
(The server abbreviations are rann and rbud.)
\par
If any of the vos backup commands failed, do a bos salvage on that
individual volume, then run that vos backup command again.
\par
The volume dumping is done with a perl script. The perl script
currently has some problems (mainly: it does not finish cleanly) but
it is still a bit more convenient than other methods. The main
function of the script is to run the backup program, giving it the
command "dump volume-set-name dump-level-name", e.g., "dump rann
/rann\_2". The script also starts up the butc program, which controls
I/O to the tape drive, and handles interaction between backup and butc
using perl's chat2 facility. Before running it, put the correct tape
into the tape drive, and wait until the green light on the front is
lit. Then, just give the dump arguments on the command line, along
with the -noclone switch, e.g.,
\begin{verbatim}
   /afs/sipb/project/newdump/backup.pl rann /rann\_2 -noclone
\end{verbatim}
For volume set rann, expect to wait about 4 hours. For rbud or rsqr,
it should take less than 2 hours. There will be on the order of 10-50
lines of output reporting the status of the backup. At the end, there
should be a final message of the form "Finished doing dump rann.rann\_2
successfully". At this point, hit control-C until the shell prompt
comes back (it will take either 2 or 3 control-C's).
\par
Once the backup of each volume set finishes, find the processes (on
hodge) running "@sys/butc" and "@sys/backup -cell sipb.mit.edu" and
kill them manually. There will be files in /usr/tmp whose names are
backup and butc followed by the date. If the backup failed, they
should be deleted. If it succeeded, they should be compressed and
stored away in project.newdump, e.g.,
\begin{verbatim}
   cd /afs/sipb/project/newdump/last
   compress -c /usr/tmp/backup.960108 > backup.rann.960108.Z
   compress -c /usr/tmp/butc.960108 > butc.rann.960108.Z
\end{verbatim}
The most common failure mode is:
\begin{verbatim}
   file\_tm: code = -1, errno = 5
    
   file\_tm: I/O error Writing data
   Error in receiving data
   butm: tape I/O error
   Dump volume-set-name.dump-level-name encountered an error
\end{verbatim}
In this case, the tape should be marked "bad", a new tape should be
inserted, and backup.pl should be run again with the same arguments.
In general, it's necessary to re-run backup.pl whenever a "Dump ...
encountered an error" message is obtained.
\par
Write an entry on the tape package listing the date of the successful
backup, the dump-level name, and username, e.g.,
\begin{verbatim}
   01-08-96 /rann\_2 paco
\end{verbatim}


\item{After the backup} of all the volume sets finishes, the AFS copy of
the backup database should be updated. This is for safety, in case the
local disk on hodge dies:
\begin{verbatim}
   rm -rf /afs/sipb/project/newdump/hodge/*
   cd /usr/afs/backup; tar cf - . | (cd /afs/sipb/project/newdump/hodge; tar xpf -)
\end{verbatim}
This would usually be the time to unlog.
\end{itemize}