osbf-lua training for the rest of us
May 31st, 2007 | Administrivia | Tags: Administrivia, Hacking, Spam Filtering | No Comments »osbf-lua is undoubtedly the best Bayesian spam filtering solution available today. It’s lightning fast (thanks to being a C extension to the tiny lua runtime), requires almost no training and is extremely effective. There are several possible ways to train osbf-lua:
- Use the built-in mail-gateway for training, i.e. send replies to yourself and use spamfilter.lua via your mda (I use maildrop). This includes the newer mass classification method with HTML emails.
- Use trainspamfilter.pl by Christian Siefkes or Holger Weiss’ more elaborate train_osbf.lua for mass training.
However, both of the above scripts are quite heavy for such a simple job and they only work on mbox mailboxes. Since I don’t write Perl or Lua and I love maildir-style storage, I came up with a simple training shell script that is ideal for my purposes:
#!/bin/sh
LEARN=' /usr/share/osbf-lua/spamfilter.lua -udir .osbf-lua '
cd $HOME
for i in `grep 'X-OSBF-Lua-Score:.*\[[+|S|-]\]’ -l Mail/cur/*`;
do cat $i | $LEARN –learn=nonspam >/dev/null ;
done
for i in `grep ‘X-OSBF-Lua-Score:.*\[[+|H|-]\]’ -l Mail/.Spam/cur/*`;
do (cat $i | $LEARN –learn=spam >/dev/null);
done
The script expects a standard maildir in ~/Mail and a working osbf-lua setup. It finds all mails that are (a) not new (since you certainly don’t want to train on unread messages) and (b) not perfectly classified and sends them to spamfilter.lua for learning. This includes mis-classified messages and those that are in the so-called reinforcement zone, i.e. those that osbf-lua is not sure about.
After training, spam messages are deleted while ham messages are kept. The spamfilter recognizes already learned messages, so there is no risk in sending them multiple times if you don’t move them out of your inbox. In fact, you could leave out the grep command completely and just pipe in all messages in cur/ and .Spam/cur but I don’t want to waste resources.
Running the above script via a cron job has the advantage that I do not have to change my email workflow: Unrecognized spam is moved to the spam folder, ham is in the inbox, just as before.