[0727] daemon@ATHENA.MIT.EDU (John Hawkinson)  SIPB issues discussion list  02/06/03 01:19 (89 lines)
Subject: jhawk project: po server monitoring
From jhawk@MIT.EDU Thu Feb 06 06:19:57 2003
Return-Path: <jhawk@MIT.EDU>
Delivered-To: sipb-discussion-mtg@CHARON.mit.edu
Received: (qmail 10150 invoked from network); 6 Feb 2003 06:19:55 -0000
Received: from pacific-carrier-annex.mit.edu (18.7.21.83)
  by charon.mit.edu with SMTP; 6 Feb 2003 06:19:55 -0000
Received: from grand-central-station.mit.edu (GRAND-CENTRAL-STATION.MIT.EDU [18.7.21.82])
	by pacific-carrier-annex.mit.edu (8.9.2/8.9.2) with ESMTP id BAA05422;
	Thu, 6 Feb 2003 01:19:26 -0500 (EST)
Received: from melbourne-city-street.mit.edu (MELBOURNE-CITY-STREET.MIT.EDU [18.7.21.86])
	by grand-central-station.mit.edu (8.9.2/8.9.2) with ESMTP id BAA13226;
	Thu, 6 Feb 2003 01:16:55 -0500 (EST)
Received: from multics.mit.edu (MULTICS.MIT.EDU [18.187.1.73])
	by melbourne-city-street.mit.edu (8.9.2/8.9.2) with ESMTP id BAA08451;
	Thu, 6 Feb 2003 01:16:55 -0500 (EST)
Received: (from jhawk@localhost) by multics.mit.edu (8.9.3)
	id BAA14731; Thu, 6 Feb 2003 01:16:55 -0500 (EST)
Date: Thu, 6 Feb 2003 01:16:55 -0500 (EST)
Message-Id: <200302060616.BAA14731@multics.mit.edu>
To: sipb-discussion@MIT.EDU
Subject: jhawk project: po server monitoring
From: John Hawkinson <jhawk@MIT.EDU>


[ email enveloped to sipb-office; discussion on sipb-discussion,
please. This is labelled as a "jhawk-project" because of its style and
because I plan to implement it, but I think some background discussion
may be of use. ]

Wednesday night, po14 experienced a ~5+ hour outage, mostly because
nobody noticed and bothered to report it to the network group.

As part of our advocacy role, I think that SIPB should implement a monitoring
system for essential parts of the infrastructure that are not otherwise
monitored. (I've thought this for some time but never got around to
really getting off my ass).

To that end, I'd like to monitor the mail system.
Specifically, this monitoring should include:

.	Superficial wellness checks of the mailhubs, po servers, outgoing,
	etc. (icmp ping, tcp connect, etc.)

.	Periodic (perhaps less frequent than the above?) actual checks
	of the sending mail through the mail system. Sourcing data at
	the published inputs (outgoing and the mit.edu MXs) and measuring
	delay until receipt on the PO servers.

.	Periodic actual checks of those subsystems that lend themselves to
	such tests. e.g. a) smtp to the po servers and retreive
	via imap b) smtp through outgoing destined for the SMTP server running
	on the test machine.

.	Reporting of the problem, presumably via zephyr and perhaps email,
	and probably not direct support for paging.

.	Statistics collection.

.	A possible future enancement would be some sort of monitoring of
	webmail and pop, as well as imap.


I assume there is support for SIPB to sponsor such a project; it seems
quite clear within our mandate of furthering [student] computing.

A special requirement this project has is IMAP mailboxes on each PO
server.  I think the best approach is to have SIPB sponsor 5
'imtestNN' accounts (for po9, po10, po11, po12, and po14) and check
delivery to those mailboxes.  Is there support or objection to this
methodology? If the latter, is a better approach available?

(The MIT mail system no longer supports delivery to PO servers other
than "your own", and I think asking the network group to create a
special mailbox might result in special treatment that would defeat
the point of the monitoring system.)


I hesitate to admit it, but I suspect the best language to write this
monitoring software in is perl. I'd be curious if there are strong
opinions about this.

If anyone has knowledge of good prior art in the area of SMTP/IMAP
monitoring, I'd be pleased to hear about it.

Feedback appreciated. I expect this service would find a home on
charon (though perhaps it should have a dedicated IP address for the
SMTP server side of the monitoring? dunno).

--jhawk
--[0727]--


Further todo items:
	Confirm validity of timestamps by ntp polling.
	Confirm path normality/consistency

