% $Source: /mit/olcdev/doc/ops/RCS/operation.tex,v $
%
% Generated by filling in /afs/dev.mit.edu/user/cfields/servers/template.tex.
% To document a service other than OLxx, please start from the template
% and read the comments there.
%
% PLEASE REMEMBER TO CHANGE THIS DOCUMENT TO MATCH REALITY.

% This is a LaTeX 2e document!
\documentclass[twoside,11pt]{article}

\usepackage[outline,45deg,helvetica]{draftdoc}
\drafttext{REALLY FINALEST DRAFT}

% define ``\rcsdate'' from the RCS date for this document
\def\datehack$#1 #2/#3/#4 #5${\ifcase #3\or January\or February\or March\or
	April\or May\or June\or July\or August\or September\or October\or
	November\or December\fi\space
    \number#4, \number#2}
\edef\rcsdate{\datehack $Date: 1999/01/14 06:43:17 $}

\def\OLxx{OL{\it xx}}

\title{OLC and Other {\OLxx} Services\footnote{The source for this
	document is in \texttt{/mit/olcdev/doc/ops/operation.tex}}}
\author{Albert Dvornik}
\date{last RCS change: \rcsdate}

% Default margins are too wide.
\addtolength{\oddsidemargin}{-.5in}
\addtolength{\evensidemargin}{-.5in}
\addtolength{\textwidth}{1in}

\begin{document}

\maketitle
\tableofcontents
\newpage

\setcounter{section}{-1}
\section{About This Document}

As of the most recent revision, this document now covers the setup and
operation of all the \OLxx\ (OLC, OLTA, OWL) services.  However, porting
and development work tends to use OLC as a model.  Therefore, this
document will tend to become somewhat OLC-centric, and may be less than
fully accurate when it comes to other \OLxx\ services (OLTA and OWL).
Please exercise extra caution when trying to upgrade or tweak the
OLTA/OWL server, and update this document with any insights you gain
into their operation.  (In general, whenever you think of a way to
improve this document, do it.)

\section{Functional Overview}
%\begin{quote}
%{\it This section contains information on the basics of what the
%server does and how it works. It is intended to both inform people of
%what it is they are working with and possibly to aid them in
%troubleshooting in circumstances that other areas of this document
%have failed to foresee.}
%\end{quote}

The software that may run on the OLC server can be divided in several
major components:

\subsection{olcd}

The OLC daemon, \texttt{olcd}, is the central part of the OLC system.
It talks to the user and consulting client binaries and performs
almost all operations associated with the operation of the OLC system.

Currently, a separate \texttt{olcd} binary most be compiled for each
of the \OLxx\ services, but the changes are minimal and restricted to
a small number of header files.

The OLTA and OWL services are currently run on the same machine.  To
avoid namespace collision, the olcd binaries for these services are
named \texttt{oltad} and \texttt{owld} respectively.

\subsection{rpd}

The replay daemon, \texttt{rpd}, is used by the
\texttt{olist}/\texttt{oreplay}/\texttt{oshow} client and by
\texttt{xolc} to read some of the log information.  This functionality
is already offered by \texttt{olcd}, but since \texttt{rpd} doesn't
lock the queue, it offers better performance.  The \texttt{rpd}
functions are currently primarily used by the Emacs OLCR interface.

The idea for the implementation of \texttt{rpd} originated in a time
when hardware was rather expensive and running a complicated
networking server taxed the available resources.  It is unlikely that
this will ever be a problem again.  As a result, designs for the
future of OLC should consider the option of entirely scrapping the
support for \texttt{rpd}; of course, such a change must take into
account all of the OLC system, including the more ``exotic''
bits.\footnote{Such as, for example, Emacs OLCR, \texttt{xolc} and
MacOLX.}

For the OLTA/OWL server, the two rpd binaries are named \texttt{rpd.olta} and
\texttt{rpd.owl} to distinguish between them.

\subsection{polld}

\texttt{polld} is a program that runs periodically to determine which
of the users with questions in the queue are currently logged in, and
where.  This helps consultants to determine the priority of various
questions.

Despite its name, \texttt{polld} does not normally run as a daemon,
but is periodically invoked from \texttt{cron}.  The \texttt{-cycle}
command-line option turns on stand-alone operation, which may be buggy
(I haven't tested it).

For the OLTA/OWL server, the two polld binaries are named
\texttt{polld.olta} and \texttt{polld.owl} to distinguish between them.

\subsection{lumberjack}

The logs of resolved questions are archived by \texttt{lumberjack},
which is invoked by \texttt{olcd} whenever needed.  If
\texttt{lumberjack} can't be run, the logs will collect dust in
\texttt{/var/athena/olxx/donelogs} until the next sucessful invocation
of \texttt{lumberjack}.

For the OLTA/OWL services running on the same machine, two lumberjack
binaries are needed.  They are named \texttt{lumberjack.olta} and
\texttt{lumberjack.owl}.

\subsection{topic archives}

\texttt{lumberjack} stores the logs of answered questions in Discuss
meetings.  Each question topic has a separate meeting. For OLC and OLTA,
the short name\footnote{If you don't know what I'm talking about, read
some Discuss documentation, which will hopefully explain it.} consists
of the letter \texttt{o} followed by the topic name: for example,
\texttt{olatex} or \texttt{oother}.  For the OWL service, the short name
of each meeting consists of the string \texttt{or} followed by the topic
name. In addition, the meeting named \texttt{ooga} should exist to store
questions resolved as ``unanswered.''

Note that other Discuss meetings may exist on the \OLxx\ server, but
they shouldn't be directly associated with the operation of the \OLxx\
software.

\subsection{mail feeds}

OLC currently supports asking questions by e-mail.  (To the best of my
knowledge, OLTA and OWL do not.)  Athena mailhubs alias
\texttt{olc-}\textit{topic} to
\texttt{olc-}\textit{topic}\texttt{@matisse.local}, and Matisse's
sendmail daemon is configured to turn each such piece of mail into a
question with appropriate \textit{topic}.  The actual work is performed
by a program named \texttt{olcm}, which is invoked from the
\texttt{aliases} file similarly to \texttt{dsmail}:

\noindent\texttt{\null\hskip1cm
olc-other@matisse.LOCAL: "|/usr/athena/etc/olcm -s matisse -t other
\\ \null\hskip1cm
-k olc.matisse"@matisse.LOCAL}

Operation of the mail feed system will not be further covered in this
document.

\subsection{compiling statistics}

OLC server currently calculates weekly statistics for answering
questions. Neither OLTA nor OWL calculate weekly statistics, although
applying the OLC scripts with appropriate modifications would not be
difficult.  The statistics generation uses a set of Perl scripts that
are not strictly a part of the OLC system.  The scripts are periodically
run from \texttt{cron}.

Statistics generation will not be further covered in this document.

\subsection{Rebecca}
\label{rebecca}

The OLC server currently runs a daemon named
Rebecca,\footnote{``Rebecca'' is short for ``Rebecca P. Davis'', which
is long for RPD.} which listens to several common Zephyr classes and
instances.  Whenever people complain about \texttt{olcd} or
\texttt{rpd}, Rebecca checks if the daemon is running and restarts it if
needed.  OLTA and OWL do not run this daemon.

Rebecca will not be further covered in this document.

\subsection{other gunk}

The OLC server will, without a doubt, also be running a lot of what
can best be described as ``other gunk''.  Most of it will be wholly
useless, such as an unrelated service installed five or ten years ago
by some forgotten maintainer for a completely incomprehensible reason.
Migration is a fine time to nuke this from the proverbial orbit.

\section{Server Characteristics}
%\begin{quote}
%{\it This section serves to supply information needed for determining
%where a service should be placed, on the basis of compatibility,
%security, and loading issues.}
%\end{quote}

\subsection{Compatibility}
%\begin{quote}
%{\it Will this service coexist peacefully, in a functional sense,
%with other services? With itself? (e.g., multiple copies of the same
%license server software running on different ports)}
%\end{quote}

There is currently no conceptual reason why two fully functional
{\OLxx} services couldn't be run on the same host.  In fact, OLTA and
OWL do just that.  Until recently, the port used by \texttt{rpd} was
hard coded into the binary, so an rpd was only run for OLTA.  Now,
however, \texttt{rpd} is run for both OLTA and OWL, with the port to be
used determined by Hesiod service entries for \texttt{olta-query} and
\texttt{owl-query}.

The {\OLxx} services can be expected to peacefully coexist with any
well-behaved non-{\OLxx} service, but are traditionally kept on a
separate server for other reasons.

%\texttt{rpd} looks up the Hesiod service entry for {\tt {\em
%$<$service$>$}-query}, rather than {\tt olc-query}.  Thus, {\tt
%olta-query} and {\tt owl-query} entries could be created to
%disambiguate between the relevant ports on the same host.

\subsection{Security}
%\begin{quote}
%{\it Is this service ``secure?'' Is its data sensitive? This
%information may be used to decide what other services may be offered
%by the same machine on the basis of whether the services have
%conflicting security interests.}
%\end{quote}

\subsubsection{Potential Targets}

The \OLxx\ servers store active user questions and Discuss logs of
old questions.  Both should be considered private, although not
critically confidential.
Users are normally prevented from making unauthorized requests (both
to access this data and to change the server state) through use of
access control and Kerberos authentication.

There is no other known sensitive data related to the \OLxx\ servers.

\subsubsection{Vulnerabilities}

The network connections between the OLC clients and the server are
currently not encrypted, and are consequently vulnerable to passive
network attacks.

The server still uses some static-sized buffers, and a user with
access to Kerberos tickets in the Athena realm might conceivably be
able to use a custom-designed client to overflow an unprotected
buffer.
% However, we'd know it was jhawk. >=)
A denial of service could certainly result from this; at worst, the
attacker may be able to gain super-user access to the server.
This problem could be remedied by rewriting all of the source to use
dynamic allocation, but this would be a rather large project for
relatively little immediate gain.

\subsection{Resource Loading}
%\begin{quote}
%{\it Is this a high or low load service? That is, does it need a
%server of its very own, or can it be put on a server with other
%services?}
%\end{quote}
%\begin{quote}
%{\it Similar to CPU loading, in case we put into production any more
%services that use a lot of network bandwidth.}
%\end{quote}

While CPU loading may have been a concern once upon a time, it is not
expected to be an issue on a modern architecture (such as a Sparc 5).
Network load from the \OLxx\ services is negligible.

\subsection{Policy}

Traditionally, root login privileges on the OLC server have been granted to
the full time consulting staff and some student consultants.  This
constiutes a further reason against using the machine(s) in question to run
other operational services.

\section{Requirements and Dependencies}

%\begin{quote}
%{\it Everything that the service and server depend on should be
%documented here. Additionally, notable things that the server does not
%depend on should also be mentioned. This second requirement exists in
%case dependencies are forgotten; the lack of mention of a dependency
%might be taken to mean that the dependency does not exist, rather than
%the dependency was overlooked. Thus, explicitly mentioning
%nondependencies can make it clear when was something was overlooked,
%rather than causing confusion in troubleshooting by implying something
%false. Examples that should generally be filled out appropriately
%follow below.}
%\end{quote}

\subsection{Required}
\begin{itemize}
\item {\it Kerberos} --- The client and the server require Kerberos.
\item {\it srvtab} --- Requires an {\tt olc} srvtab entry
\item {\it root access} --- Currently needs to run as root. [change?]
\item {\it Athena} --- Current installation requires either Athena 8.1
or 8.2.
\item {\it Platforms} --- Runs under Sun/Solaris 2.5.1 or Solaris 2.6.
  A small amount of porting may be required to make it work on other
  POSIX and XPG3 compliant platforms.
\item {\it Hesiod} --- To be accessed via the UNIX {\tt olc/olcr} client,
{\tt ol{\it xx} sloc} and {\tt ol{\it xx} service} should exist in Hesiod.
(Alternatively, the fallback {\tt sloc} information can be retrieved from
the client configuration file, and the {\tt service} information from the
Athena-wide {\tt /etc/services} file.)

To be accessed via the UNIX {\tt oreplay/olist/oshow} client, {\tt ol{\em
xx} sloc} and {\tt ol{\em xx}-query service} should exist in Hesiod.  (The
fallbacks are as above.)

In addition to those already mentioned, access via the Macintosh client is
believed to depend on {\tt ol{\it xx}-locking service} entry.
\item {\it name service} --- Clients use name service in connecting to
the server.

\item {\it syslog} --- Uses syslogging.  (A running \texttt{syslogd}
is not strictly required for operation, but in the Athena environment
the data provided is extremely useful for diagnosing and correcting
problems.)

\item {\it zhm} --- OLC uses zephyr to notify users and consultants of
various events.
\end{itemize}

\subsection{Required for Interoperability with Earlier Versions}

\begin{itemize}
\item {\it Hesiod} --- Old UNIX clients (Release 8.0 and earlier) used the
{\tt ols service} Hesiod entry (instead of {\tt olc-query}).  When those
clients are fully desupported, and after a final {\tt grep} through the
source tree, this entry can be removed.
\item {\it moon} --- The machine must take an accidental update when the
Lunar cycles say so.
\end{itemize}

\subsection{Not Required}
\begin{itemize}

\item {\it disk space} --- The queue takes up relatively little space
(<100MB).  On the other hand, the Discuss archives have the ability to
grow indefinitely.  The disks should not be allowed to fill during
operation.
\item {\it swap space/real memory} --- Has no notable memory requirements,
except that the swap space should not be allowed to fill during operation.
\item {\it peripherals} --- Requires no special peripherals; an
external disk may be used to preform regular local backups.
\item {\it /srvd} --- RVD need not be attached.
\item {\it ethernet} --- Does not depend on ethernet address.
\end{itemize}

\section{Dependents}
% \begin{quote}
% {\it This section lists all of the client software, other services,
% and users who depend on the correct functioning of this service. In
% some cases, such as Kerberos, it may not make sense to list all of the
% dependents. This section should serve to give an idea of the impacts
% when the service is lost and what priority should be associated with
% restoring the service.}
% \end{quote}

The direct dependents of the OLC system are the On-Line Consultants.
However, the entire Athena user community uses the OLC service to ask
(and answer) questions.  Finally, since problems are more likely to
get noticed when reported to the consultants, and since noticed
problems are easier to correct in a timely fashion, successful
operation of OLC helps Athena Operations in their eternal vigilance.

The direct dependents of the OLTA system are the students of the classes
using the OLTA system and the TA's and professors answering those
questions.  At the time of this writing, 10.001, 6.170, and 10.301 are
the classes most often using OLTA.

The direct dependents of the OWL system are the Librarians of the MIT
Libraries, and the students and staff members using the OWL service to
seek help with their research.

\section{Impacts of Service Failures}
% \begin{quote}
% {\it This section should enumerate the possible impacts of service
% failures on the basis of failure type. This section is referred to
% later for easy identification of failure modes that may be caused by
% various maintenance operations.}
% \end{quote}

\begin{enumerate}
\item
If \texttt{olcd} is not operating properly, users will be unable to
ask questions and consultants will be unable to answer them.

If the daemon is not running, it can get restarted by Rebecca
(sec.~\ref{rebecca}).

\item
If \texttt{rpd} is not operating properly, consultants will be unable
to use the Emacs OLCR client.  However, they can still use the
prompt-based client, and users will still be able to ask questions.
Most consultants normally use EOLCR, so this will decrease their
efficiency at answering questions.

If the daemon is not running, it can get restarted by Rebecca
(sec.~\ref{rebecca}).

\item
If \texttt{polld} is not operating properly, the data on who is
currently logged in will be out of date or incorrect.

\item
If \texttt{lumberjack} is not operating properly, the resolved logs
will not be placed in archives, but will accumulate in the
\texttt{donelogs} subdirectory.  Note that this is sometimes desirable
behavior, such as during migrations; you may wish to make the
lumberjack binary to become unavailable in those situations by
renaming it.

\item
If the Discuss daemons aren't operating properly, consultants will be
unable to read logs of past questions and extract the wisdom of the
ages from them.

\item
If the local sendmail daemon isn't running correctly, users will be
unable to ask questions by e-mail, and consultants will be unable to
correct previous answers by mail.  If the mailhubs aren't operating
properly, in addition to the above, consultants will be unable to use
e-mail to communicate to logged-out users.

\item
If Rebecca isn't operating properly, consultants without root login
privileges on the server won't be able to restart \texttt{olcd} and
\texttt{rpd}.  Furthermore, we'll be deprived of her wit and humor.

\item
If \texttt{zhm} is not running, \texttt{olcd} will fall back to a
networked version of \texttt{write}.
\end{enumerate}

There are doubtlessly more failure modes that have not been mentioned
here, but one should hope their consequences are less significant than
the ones noted above.

\section{Building}
% \begin{quote}
% {\it This section gives the location of the development source code
% and instructions on copying, building and potentially testing it. If a
% special build environment is required, such as a particular compiler
% or Athena software release, this should be mentioned. If the code does
% not build cleanly, this should also be mentioned. Additionally, any
% read-access restrictions that might exist for the sources should be
% mentioned here.}
% \end{quote}

First and foremost: if you ever build OLC on Athena, you have a moral
obligation to ensure that any changes you make get incorporated back
into a central source tree, which is\footnote{Well, it will soon be,
as of release 8.3.  Someone get rid of this footnote as soon as it's
actually true.} currently a part of the Athena release.

[NOTE: these instructions aren't intended to work with the source tree
from the release yet.  That should be fixed once the release has a
useful source tree.]

The simplest way to build the OLC source tree is to find the
corresponding copy of the \texttt{ramirez} script and invoke it as
\texttt{./ramirez -X -Wall -DLOG}.  (\texttt{-DLOG} adds debugging
syslogs for each type of request.)  For building OLTA/OWL binaries, add
\texttt{-DOLTA} or \texttt{-DOWL} to the flags given, as appropriate.
Alternatively, try the following:

\begin{itemize}
\item Create a symlink farm into the source tree.  Change your working
directory to the top level of the link farm.  (You could just build in
the source tree, but someone will hate you for it, I bet.)
\item Run \texttt{imake -Iconfig -DUseWclFromLocker}.
\item Change to \texttt{./server} and run \texttt{make all}.
\item Copy the server binaries somewhere useful.
\end{itemize}

\section{Installation}
\label{installation}
% \begin{quote}
% {\it Here instructions for installing the software on a server are
% given. This should include things like getting a srvtab entry, placing
% lines in inetd.conf, editing rc.local or its equivalent, etc. It
% should also provide a description of the server's directory hierarchy
% and the files within it. Most of the installation work should be done
% by mkserv, but this section should describe that work for purposes of
% troubleshooting, etc.}
% \end{quote}

The following steps should be followed to install a new server:

Note that all instances of ``olxx'' below should be replaced with the
name of the appropriate service (olc, olta, or owl).

\begin{itemize}
\item If creating an OLC server, run \texttt{mkserv ops
olc}.\footnote{Actually, currently you need to run \texttt{mkserv ops
olcdev:/mit/olcdev/mkserv olc}, until the OLC mkserv scripts are
installed in the mkserv locker.}  (This will list many of the
instructions that follow.)

Note that \texttt{mkserv ops} will automatically invoke
\texttt{remote}, while \texttt{olc} will automatically invoke
\texttt{mail} and \texttt{discuss}.

\item If creating an OLTA/OWL server, run only \texttt{mkserv ops mail
discuss}. \texttt{mkserv olc} is still OLC specific, so you'll need to
use the seperate mkserv scripts and do some things by hand.

The most important thing will be to build the appropriate binaries.
OLTA/OWL binaries can be built using the build system described in this
document, using \texttt{-DOLXX} as appropriate. 
\item Install a \texttt{olc.hostname} srvtab in /etc/athena/ol{\it xx}/srvtab.
\item Create a list of topics in /etc/athena/ol{\it xx}/topics.
\item Create the \texttt{.acl} files under /etc/athena/ol{\it xx}/specialties/.
\item Create and populate the \texttt{.acl} files under
/etc/athena/ol{\it xx}/acls/.
\item Examine other files in /etc/athena/ol{\it xx}, and edit if needed.
\item Set up Discuss archives.
\item Set up olcm mail feeds.

\item Make sure all of the relevant \texttt{cron} jobs are installed.
Currently:
\begin{verbatim}
# polld locates users and updates their OLC login status (by talking to olcd)
1,11,21,31,41,51 * * * *        /usr/athena/etc/polld
# Incrementals to /backup/inc
#  Note: medusa runs remote UFS backups via the network.  Ideally, these
#    happen ~3 times a week.  The following line makes a daily incremental
#    dump (relative to the last medusa dump) and stores it on an external
#    drive.  This should happen daily.  --bert 12feb1998
55 17 * * *     /var/ops/bin/backup_inc 7
# Copy OLC and Discuss acls from Moira lists.
19 7,22 * * *   /u1/moira-sync/syncer.pl >> /u1/moira-sync/log.stdout
# The following jobs deal with generating OLC statistics reports
0 0 * * *       /var/ops/bin/gen_olc_stats.pl -a /var/athena/olc/stats/ask_stats\
	 -r /var/athena/olc/stats/res_stats
0 10 * * 1      /var/ops/bin/olc-report.pl -w; /var/ops/bin/olc-report.pl -w -c
\end{verbatim}

Again, this is OLC specific.  For the OLTA/OWL server, only the polld
cron job is required.  However, both copies of polld
(\texttt{polld.olta} and \texttt{polld.owl}) will need to be run from cron.
\item Make sure to install all of the scritps that \texttt{cron} will
try to run.  Many of these are located in
/mit/ops/services/olc/scripts.

\item For the OLC server, set up Rebecca.  She lives in
/mit/ops/services/olc/rebecca, and /var/ops/bin on the server.  Her init
script should be linked to /etc/rc2.d/S95rebecca.
\end{itemize}

\section{Maintenance}
% \begin{quote}
% {\it This section addresses all of the required maintenance aspects of
% the server. Each subsection, as appropriate, should include
% information on the form of service impact (as categorized in ``Impacts
% of Service Failures''), or lack thereof, caused by the operation.}
% \end{quote}

%\subsection{Periodic Maintenance}
% \begin{quote}
% {\it Any maintenance that periodically needs to be done should be
% described here, whether done by hand or automated. This may include
% removal or archiving of old log files, as well as turning logs over.}
% \end{quote}

\subsection{Relocation}
% \begin{quote}
% {\it This section explains the steps required in moving a service
% from one server machine to another. It may mention what parts of
% configuration files need to be updated and what Hesiod information
% needs to be changed. It may also need to include information on
% converting data files to reflect machine changes (due to hostname
% change or byte ordering, etc.). }
% \end{quote}

The following steps should be followed to migrate a server.  It is
assumed that the new server will, at the end of migration, have the
new server's hostname and IP address.

\begin{itemize}
\item[] On the new server:
\item Run \texttt{mkserv ops olc} or \texttt{mkserv ops mail discuss} as
appropriate.\footnote{Once again, currently \texttt{mkserv
olcdev:/mit/olcdev/mkserv olc}.}
\item Copy the contents of /etc/athena/ol{\it xx}.  (This needs to be
encrypted, since it contains the srvtab!)
\item Set up olcm mail feeds.
\item Copy the \texttt{/.klogin}.

\item Set up other related stuff (Rebecca, stats,
backups\ldots). (sec.~\ref{installation})

\item[] Shortly in advance of the move, on the old server:
\item Change the OLxx motd.
\item Disable \texttt{lumberjack} by renaming the binary in /usr/athena/etc.
\item Migrate the Discuss archives.

\item[] This is a good point to consider turning back. =)

\item[] To start the move, on the old server:
\item Restart \texttt{olcd} in the maintenance mode, by killing the
old daemon and restarting it with the \texttt{-maint} option.
\item Change the OLxx motd to indicate that a move is in progress.
\item Copy all queue data (from /var/athena/ol{\it xx}, or on old servers,
from /usr/spool/ol{\it xx}) to the new server.

\item Bring down the old server.
\item Restart the new server with the old server's hostname and IP address.
\item Make sure everything is running as expected.

\item Optionally, change the OLxx motd to indicate that the move succeeded.
\item Optionally, bring up the old server on a different IP address.
\end{itemize}

If the move fails in some non-salvageable way, the old server should
be brought back up on its old address and \texttt{lumberjack}
re-enabled, until the problem can be fixed.

\subsection{Reconfiguration}
% \begin{quote}
% {\it This section includes information on how to make server
% configuration changes. It should mention what service impact, if any,
% should be expected when changing things, how to make the server
% recognize changes, and what to do in case of failure.}
% \end{quote}

Most configuration changes that are expected under normal operation of
\OLxx\ can be made using the administrative functions of
\texttt{olcr}.  Please avoid making those changes by hand on the
server.

[Make a list of things that can't?]

\subsection{Restarts}
% \begin{quote}
% {\it This section explains how to restart the server, mentioning the
% same points as above in the reconfiguration section.}
% \end{quote}

The simplest way of restarting \texttt{olcd} and \texttt{rpd} on the
OLC server is by using Rebecca.  (For more information, try
\texttt{zwrite olc.matisse -m "Rebecca, help"}.)

Otherwise, \texttt{olcd} and \texttt{rpd} will shut down cleanly when
killed with SIGHUP, SIGINT or SIGTERM, and can be restarted by simply
running /usr/athena/etc/olcd and /usr/athena/etc/rpd, respectively.

\subsection{Updates}
% \begin{quote}
% {\it This section contains similar information to the reconfiguration
% section, except in the context of putting a new version of server
% software into place. It additionally documents any special procedures
% to follow when doing this. For example, if the service is redundantly
% offered, it may be desirable to update the primary server only to
% begin with (if that is the architecture), so that it will be tested,
% and in case of failure fallback to backup servers will occur. Or, it
% may be the case that all servers must be updated simultaneously.}
% \end{quote}

If the \OLxx\ server software is updated, any number of things
described in this document could change.  Therefore, whoever is
updating a server should make sure to update this document as needed.

Normally, updates will be tied to Athena version updates.  Therefore,
unless the data formats change, you can just use normal procedures for
Athena version upgrades.  However, I would heartily recommend trying
the upgrade (with live data) on a different machine first.

\subsection{Shutdowns}
% \begin{quote}
% {\it This section provides an explanation of how to shutdown a server
% in an orderly fashion. It also gives information, depending on the
% expected duration of the shutdown, what agencies or users to inform of
% the shutdown and how to do so.}
% \end{quote}

\texttt{olcd} and \texttt{rpd} will shut down cleanly when killed with
SIGHUP, SIGINT or SIGTERM.  However, it is often preferable to leave
\texttt{rpd} running and restart \texttt{olcd} in maintenance mode
(with the \texttt{-maint} command-line argument), which will enable
the users to see the MOTD explaining the problem.

\subsection{Removal from Service}
% \begin{quote}
% {\it This section should outline the steps that need to be taken when
% a server is removed from service forever, including listing who needs
% to be informed and what kind of a warning timeframe should exist.}
% \end{quote}

If any of the \OLxx\ services is to be taken out of service, the
Athena user community should be notified and given time to complain
far in advance.

If there is to be an Irish wake, make sure to invite the full-time
consultants and all past OLC developers (myself included).  Please be
aware that letting MIT know about any Irish wakes for servers may be a
violation of MIT's alcohol policies.  Don't tell them when OLC code
drives you to drink, either.  [It will. =)]

\subsection{Backups and Restores}
% \begin{quote}
% {\it This section explains the nature of doing server backups and
% restores. Sometimes special tweaking may be necessary before a backup
% or after a restore for best results, and that should be explained
% here.}
% \end{quote}

The OLC server currently backs up its disk daily at dump level 7, to
supplement the regular level 0 network backups from Medusa.  This uses
standard Ops scripts for the purpose.  The OLTA/OWL server is backed up
to Medusa with the same schedule, but does not perform daily backups to
an alternate local disk.

\section{Troubleshooting}

\subsection{Diagnostics}
% \begin{quote}
% {\it This section should describe diagnostic procedures and methods
% available for tracking down and fixing problems. These may include
% interpreting log files, acquiring core dumps, and using special
% software provided to exercise the service. It should also include
% information on things not to touch or do as appropriate.}
% \end{quote}

The bad news regarding diagnostics is that the best available
diagnostic tool is currently the source code.
The good news regarding diagnostics is that the best available
diagnostic tool is currently the source code.

Many server messages are syslogged using the \texttt{local6} facility.
Those messages will by default be logged to
/var/athena/ol{\it xx}/admin/syslog, and should prove insight into most
service failures (we hope).  You may wish to also forward those
messages elsewhere to keep an eye on the state of the server.

Coredumps will be found in /var/athena/ol{\it xx}/core.\textit{date}.

\subsection{Known/Predicted Failure Modes}
% \begin{quote}
% {\it Any known or potential failure modes should be mentioned here,
% their causes and consequences. Some examples are included here.}
% \end{quote}

\begin{itemize}

\item \textit{olcd died} --- The command-line clients will display the message
\\ \texttt{Unable to connect to the OLC daemon.  Please try again later.} \\
Other clients will display similar errors.

\item \textit{rpd died} --- The EOLCR list of questions will fail to
be displayed.  \texttt{olist}, \texttt{oshow} and \texttt{oreplay}
will show messages like
\\ \texttt{Thu 05-Mar-98  8:09pm connect: Connection refused}

\item \textit{other errors} --- It is unclear how gracefully the
system will deal with most other errors, especially those that may
occur in a large number of places (eg, filling disk or swap).  Many
errors may result in the daemon dying, with the results described
above.

\end{itemize}

\subsection{Service}
%\begin{quote}
%{\it Where should common bugs be sent? Critical bugs? Is there a
%developer to be paged for emergencies? Someone else at MIT? A vendor
%to contact?}
%\end{quote}

Bug reports should be sent to {\tt $<$bug-olc@mit.edu$>$} for posterity.
Prompt responses should not be expected; outside of periods of testing,
there is no one responsible for fixing critical problems in a timely
fashion.

\section{State Information}
% \begin{quote}
% {\it A basic description of the state the server keeps should be
% included here. This includes logs, database files, etc. This section
% should also include information on who, if anyone, has an interest
% in any of this state.}
% \end{quote}

All current server configuration state is stored under
/etc/athena/ol{\it xx}/.  All current queue and log data is stored under
/var/athena/ol{\it xx}/.  All past OLC/OLTA question data is stored in Discuss
meetings whose names start with \texttt{o}.  All past OWL question data
is stored in Discuss meetings whose names start with \texttt{or}.

\end{document}




