bogofilter-milter.pl — A Sendmail::PMilter Milter for bogofilter
Introduction
(Click here to skip
to recent changes. Click here to return to my
home page. Last updated at $Date: 2024/02/02 20:08:25 $.)
Bogofilter is a mail filter
that classifies mail as spam or ham (non-spam) by a statistical
analysis of the message's header and content (body). The program is
able to learn from the user's classifications and corrections.
Bogofilter can be used in three different ways:
- it can be integrated into the user's email client
(e.g., Evolution);
- it can be integrated into the user's delivery agent
(e.g, procmail); or
- it can be integrated into the mail transfer agent, a.k.a. MTA
(e.g., sendmail, Postfix).
To make it easier to integrate bogofilter and other applications
into MTA's, several of them, including sendmail and Postfix, implement
support mail filters
called Milters. That's where
bogofilter-milter.pl, a bogofilter Milter, comes into play. If you
would like to do bogofilter spam filtering in the MTA, so that spam
can be rejected before your mail server even tries to deliver it to a
specific user, then you've come to the right place.
If, on the other hand, you don't understand anything written above,
then you should probably go elsewhere. :-)
Installation and configuration
- Download the script here and
save it somewhere on your mail server without the ".txt"
extension.
- Run "perl -c" on it to confirm that you've got all the necessary
Perl modules installed. If not, install the module that Perl reports
is missing. Repeat until "perl -c" returns no errors.
- Search for "BEGIN CONFIGURABLE SETTINGS" in the script, carefully
read through everything up until "END CONFIGURABLE SETTINGS", and
modify the settings as appropriate for your site.
- Run "perl -c" on the script again to confirm that you didn't make
a syntax error while editing setings.
- Set it up to be started automatically when your server reboots.
For example, here's a script
for Fedora and similar systems,
e.g., RHEL and CentOS, and here's one (thanks to Tom Anderson)
for Gentoo.
- Start it up using the init script you just installed.
- Tell your MTA to start using it. See, for example, documentation
for sendmail
and Postfix.
- Test, test, test! It's pretty much "set and forget" once it's
working, but make sure it's working before you set and forget it!
Training
Bogofilter learns from each new message it sees. That is, when it
sees a message that it thinks it spam, it learns that messages that
look a lot like that one should also be considered spam. This is the
essence of how bogofilter works.
When bogofilter either makes a mistake (i.e., decides that
something is spam that actually isn't, or vice versa), it
needs to be told that it made a mistake, so that it won't
make similar mistakes in the future. Furthermore, when bogofilter is
unsure whether a particular message is spam, you need to tell it so
that it'll have a better chance of being able to figure it out next
time.
When bogofilter is integrated into your email client, training
bogofilter about what is and isn't spam is easy — you use
commands in your email client to tell bogofilter whether a particular
message is spam or not, and then the email client tells bogofilter to
retrain the message as necessary based on your instructions. However,
bogofilter-milter.pl has no such built-in training functionality, so
you need to roll your own.
Here's how I do this. Feel free to use my method as-is, adapt it
to your own tastes, or do something completely different. Understand,
however, that if you do nothing, bogofilter will not
work.
- I have four special IMAP folders called "bogotrain", "despam",
"maybespam", and "spamtrain".
- I have a procmail recipe that redirects unsure messages into my
"maybespam" folder so that I can classify them at my leisure and they
don't clutter up my inbox:
:0
* ^X-Bogosity: (unsure|spam)
{ FOLDER=user.$LOGNAME.maybespam }
...
:0 w : $MAILDIR/cyrus$LOCKEXT
|formail -I "From " |/usr/lib/cyrus-imapd/deliver -a $LOGNAME -m $FOLDER -q
Obviously, you'll have to do this slightly differently if you use
something other than procmail
and Cyrus imapd.
- I have bogofilter-milter.pl configured to save copies of both spam
and ham messages. That is, I have symbolic links called "archive" and
"ham_archive", the names configured for $archvie_mbox and
$ham_archive_mbox, in my .bogofilter directory, pointing at where I
would like the copies to be archived. I also run logrotate once per
hour to rotate these archives when they get too large and to save
about a month worth of old ham archives and about a week worth of old
spam archives (anything more than that would simply take up too much
disk space!).
- When I see a spam message in my inbox or maybespam that made it
through bogofilter, I move it to either bogotrain or spamtrain,
depending on whether I want to automatically submit it to SpamCop in
addition to retraining bogofilter.
- When I see a ham message in my maybespam, I put a copy of it in
despam and move the original into my inbox.
- When I discover that bogofilter has falsely classified a ham
message as spam, I find the incorrectly classified message in my spam
mbox archive, pipe it into "bogofilter -Sn" to classify it, and then
remove it from the spam mbox and add it to the ham mbox. I'm OK with
this particular task having to be done manually because it happens
quite infrequently and I don't want a huge spam folder in my IMAP
account.
However, every once in a while I decide that I need to retrain
bogofilter from scratch, i.e., remove my word list and let bogofilter
start build a new one from new incoming email. When I'm doing this,
bogofilter makes a lot of mistakes for a day or two, so I enable
training mode in the Milter (which causes spam messages to be
delivered to my inbox instead of being rejected), create an "isspam"
folder in my IMAP inbox, and tell procmail to put messages there that
bogofilter has classified as spam. Several times a day I go through
this folder, delete the messages that are in fact spam, copy the
non-spam messages into "despam", and then move the originals into my
inbox.
- Now comes the magic... My spamtrain
script is invoked once per hour out of my crontab to do retraining
automatically. It reads my bogospam, spamtrain and despam folders,
and for each message in them, it:
- finds the corresponding message in the ham or spam archive mbox;
- feeds it into bogofilter to retrain as necessary;
- moves it into the other archive mbox;
- (optionally) forwards it to SpamCop; and
- deletes the message from the IMAP folder.
Because I have three different classification folders with different
purposes, my script is invoked three times each hour: once with no
arguments, once with "--mailbox bogospam --nospamcop", and once with
"--mailbox despam --nospamcop --despam".
You may wonder why I keep archive mboxes of ham and spam.
Theoretically, I could take each message directly from the IMAP folder
and feed it into bogofilter for retraining. I don't do this for two
reasons:
- The messages that sendmail delivers into my IMAP account are
actually not entirely identical to the messages that are seen by the
Milter. For example, sendmail adds a new "Received" header before
delivery, makes formatting changes to some header fields, sometimes
even makes changes to the body of the message, etc. This is
especially true now that I've recently started using the Milter's
"milter-filter-script" functionality to reformat incoming messages
with spamitarium
before running them through bogofilter. Keeping ham and spam archives
improves bogofilter's accuracy by ensuring that bogofilter is always
retrained with messages that are identical to what it would see when
called by the milter.
- I use my ham and spam archives to retrain
bogofilter when its accuracy starts to suffer.
Retraining
Bogofilter has a number
of configuration
parameters that can be tweaked to alter its behavior. The optimal
values for these parameters vary over time and from person to person,
because spammers are constantly changing the content of their messages
to evade filters and because every person has a slightly different
definition of what constitutes spam and ham.
You can tweak these configuration parameters by hand to try to make
bogofilter work better, but it's easier, and you'll probably get
better results, to
let bogotune
figure out the optimal settings.
I use a script called my-bogotrain
to do this. It uncompresses and concatenates my ham and spam mbox
archives into separate files in /tmp, runs bogofilter over them to
check for misclassified messages, and then runs bogotune on the files
when they're clean. Bogotune spits out the recommended parameters
when it's done, and I copy them into ~/.bogofilter.cf. I usually only
bother to do this when I notice that spam filtering isn't working so
well.
See the comments at the top of my-bogotrain for more information
about how to use it.
Spamitarium
Tom Anderson wrote a neat little script, spamitarium, whose purpose
is to preprocess incoming email before it gets fed into bogofilter to
decrease "noise" in the email and improve bogofilter's effectiveness.
Tom has handed off maintenance of the script to me. The current
version can be
downloaded
from SourceForge, or if you have a recent version of bogofilter
installed on your system, it may have been included with it. Make sure
you're using at least version 0.5.1 or the version you've got is too
old.
You can find my bogofilter-milter.pl filter script which calls
spamitarium, i.e., the script I put in
~/.bogofilter/milter-filter-script, here.
Seeing it all in action
You can see some neat graphs showing the performance of my
bogofilter installation on my home page.
ChangeLog
Date | Description |
2024-02-02 |
- Now using Sendmail::PMilter instead of
Sendmail::Milter.
- Now decoding encoded email subjects before checking against
subject filters. New requirement: Email::MIME Perl module
needs to be available.
- Changed location of PID and socket files from /var/run to
/run.
- Default number of interpreters increased from 7 to 15.
|
2017-09-12 |
- Completely rewritten version of "my-bogotrain" which is much
more performant and robust.
- New version of "milter-filter-script.txt" with the ability
to save its input and output when debug is true.
- Updated text about and links to spamitarium.pl, since I'm
now maintaining the version that is included with
bogofilter.
- Updated links on this page and deleted stale links.
- Updates to bogofilter-milter.pl.txt for IPv6 and to run more
concurrent interpreters.
- Updated version of spamtrain script.
|
2010-04-12 | New version of "spamtrain" script which
supports a "--redeliver" option for causing messages to be redelivered
after they are processed. This is useful, e.g., if you've just
despammed a message and you want it to go through your
.procmailrc. |
2010-04-08 | Update "Training" section to discuss using
an "isspam" folder during training to make it easier to correct
bogofilter when it misclassifies ham as spam. |
2010-04-07 | New version (1.77):
Messages should be archived in $archive_mbox and $ham_archive_mbox
even when in training mode. This gives the user complete control over
the behavior, since s/he can create or delete the archive files along
with creating or deleting $training_file. |
2010-04-07 | New version (1.76):
- Add support for passing various important information from the
Milter to the filter script through environment variables. See the
$filter_script documentation in the configuration section for more
information.
- Post my version of spamitarium.pl and its wrapper script.
|
2010-04-07 |
- Published this page, including releasing my spamtrain and
my-bogotrain scripts for the first time.
- New version of bogofilter-milter.pl (1.74) that supports
feeding messages through an external filter before feeding them to
bogofilter. Search for $filter_script in the configuration section
for more information.
- bogofilter-milter.pl now adds a unique identifier to each message
by adding a variable called "milter_id" to the X-Bogosity line.
This is extremely useful for automated retraining tools such as my
spamtrain script documented above, which need to be able to match up
a message in an IMAP folder with the same message in an mbox archive
created by the Milter, when the MTA may have changed the copy in the
IMAP folder such that it is no longer identical to the copy in the
mbox archive.
- bogofilter-milter.pl now supports per-user Subject filters, i.e.,
user-specific regular expressions which are matched against incoming
messages to detect messages that should not be filtered. Search for
$subject_filter_file in the configuration section for more
information.
|