SpamAssassin collective education
As a bayesian spam filter,
SpamAssassin's efficience greatly improves when one educates him ; i.e., we have to tell him what kind of mail is considered as spam or not by our users, and to report him his mistakes.
The setup described bellow :
Overview
Seeding the database
To begin with, we have to seed the database with thousands of both spam and ham, with :
- spam : http://spamarchive.org
- ham : mail in our users' inbox which has the the 'reply' Maildir flag set. The assumption is that mail which someone replied to was probably not spam.
It works best if the spam and ham databases are roughly of equal size.
Daily training
A typical practice to keep the database size down and make the system more efficient, is to train only on mistakes :
- false-negatives : spam not detected ;
- false-positives : legitimate email erroneously detected as spam ;
but... we won't get enough false positives, hopefully, for this to be efficient. That's why we'll train SpamAssassin with his mistakes
and with the email which have been replied to.
Of course, mail users have to be said how they can contribute ; an example in french :
AutoAideMail#Faire_sa_peau_au_spam?.
False-negatives life-cycle
IMAP users
They simply have to move the false-negatives to their
Spam IMAP folder.
A daily cronjob runs the
sa-education-false-negatives script that
injects into 'sa-learn --spam' the mail located in the Spam folders, less than 4 days old, but more than 3 days old.
It doesn't matter if these messages are false-negatives or spam already detected, since SpamAssassin keeps track of the messages it has already learnt and ignores them accordingly.
Note: old mail is is automatically deleted from this folder by courier-imap, thanks to the setting
IMAP_EMPTYTRASH=Trash:30,Spam:30 in
/etc/courier/imapd.
Webmail users
The SquirrelMail's
Spam Buttons plugin places a "Spam" button on the message list page as well as on the message view page. This button moves the selected messages to the
Spam IMAP folder.
POP users
They simply have to
bounce (
not forward) the false-negatives to a
spamtrap.
Spamtrap
A spamtrap is a mailbox that is periodically used to feed
sa-learn --spam.
A typical practice is to create a Unix user called as you want, and have all of his mail directly piped to
sa-learn with
procmail, as explained
on this page and
on this one.
But we dislike creating Unix users, that's why :
- we create a virtual mailbox 'spam@boum.org'
- we tell AMaVis to apply no virus or spam filter on the mail sent to this address :
$spam_lovers{lc('spam@boum.org')} = 1; and $virus_lovers{lc('spam@boum.org')} = 1; in /etc/amavis/amavisd.conf (or /etc/amavis/conf.d/50-user, for Debian etch)
- a daily cronjob runs the sa-education-spamtrap script, that :
- feeds
sa-learn --spam with the mail received by the spamtrap, ignoring the Resent-* headers by using the sa-education-spamtrap_user_prefs custom SpamAssassin config file ;
- deletes this mail.
It's important to allow only our authentificated users to send mail to this address. This can be achieved by using Postfix
smtpd_recipient_restrictions and more specifically the
check_recipient_access restriction to check an
access table.
E.g. in
/etc/postfix/main.cf :
smtpd_recipient_restrictions =
permit_sasl_authenticated,
permit_mynetworks,
check_recipient_access hash:/etc/postfix/recipient_access,
reject_non_fqdn_hostname,
reject_unauth_destination,
< ... additionnal RBL checks and policy daemons calls ... >
And in
/etc/postfix/recipient_access :
spam@boum.org REJECT
Ham life-cycle
A weekly cronjob runs the
sa-education-ham script, that injects into
sa-learn --ham the messages that have been replied to in the last 7 days.
False-positives life-cycle
NB : it is important to have the Spamassassin setting
report_safe set to 0, else the false positives are stored attached to the spam report email, and then it's hard to inject them into
sa-learn --ham.
IMAP users
They simply have to move the false-negatives to their
NonSpam IMAP folder.
A daily cronjob runs the
sa-education-false-positives script that :
inject into 'sa-learn --ham' the mail less than 1 day old located in the NonSpam folders.
Note: old mail is is automatically cleaned from this folder by courier-imap, thanks to the setting
IMAP_EMPTYTRASH in
/etc/courier/imapd.
Webmail users
The SquirrelMail's
Spam Buttons plugin places a "Not Spam" button on the message list page as well as on a single message view page. This button moves the selected messages to the
NonSpam IMAP folder.
POP users
They simply have to
bounce (
not forward) the false-positives to a
nonspamtrap.
NonSpamtrap
A nonspamtrap is a mailbox that is periodically used to feed
sa-learn --ham.
It works exactly like the spamtrap described above, with a
sa-education-nonspamtrap script doing so.
Bonus hacks
- spam_and_nonspam_folders-1.5.1.diff: tell squirrelmail the Spam and NonSpam? folders are special ones (it will then display them in a special way, and prevent them to be deleted, at least)
Sources
Information and sentences stolen on :