Spam Filter Instructions.

Background.

The filter relies on open source software called Bogofilter that implements Bayes Theorem. The filter requires a training period, as well as a small amount of on-going training. Messages considered spam will be moved to the possible_spam folder. You'll get a nightly report of the subject and sender of all spam messages. You can find these spam messages based on their subject line. These messages will be in your possible_spam folder. It is possible that some good email will be incorrectly moved to the possible_spam (false positives). It is very important that you move such messages into the ham_raw folder to help train the filter.

The filter looks for spam in your Inbox once per hour at 10 minutes after the hour. Spam arrives constantly, so it is normal to see a few recent spams in your Inbox. Any spam older than two hours has been missed by the system, and you should manually move such messages to the spam_raw folder.

There is a spam threshold that is (ideally) set high enough that you get no false positives. A false positive is a good email incorrectly identified as spam. A small percentage of spam will pass the filter, and this is expected, and is due to a safety margin in the filter settings. Every now and then, you should delete old messages from the possible_spam folder. I recommend that once a month you delete possible_spam messages older than one week. Leave the most recent one week of spam in case good email was accidentally filtered.

Getting started.

The spam system uses 3 folders which will be created by the sysadmin when the spam filter is put into training mode.

  • ham_raw
  • spam_raw
  • possible_spam

You may need to "subscribe" to these folders. You can subscribe to folders from your IMAP email client (for example: Thunderbird) or you can use webmail. For webmail, please login to webmail.laudeman.com. Click the "Folders" link at the top, and use the menus/checkboxes, etc. to subscribe to the 3 spam folders.

Normally, the filter is in "training mode" for the first week or two. It will perform all normal operations, but will not move any messages to the possible_spam folder.

You'll need to remind the system administrator (Tom) to enable the filter for real after a week of training.

How to train the filter.

Move good email into the ham_raw folder, and spam into the spam_raw folder. Once per hour the system will check these two folders, and update your personal Baysian wordlist databases. Messages that have been processed are moved into Inbox and possible_spam, respectively.

Please be careful. If you incorrectly categorize email, the filter will not perform well. After a week or so of initial training, it should only be necessary to move occasional good email into the ham_raw folder. You will probably move several spams (less than 10) per day into the spam_raw folder. A busy mailbox can expect 3 to 5 spams per day to be missed. The filter is typically removes nearly all spam.

The filter is designed to let some spam through, and correctly keep all good email. The filter works best if good email is never put into possible_spam.

Once per day (at 5:00 am) you will get an email from the system called "Spam filter results". Carefully check this report to make sure no good email was incorrectly filtered. Any good email that the filter incorrectly put into possible_spam should be manually moved (by you) from the possible_spam into the ham_raw folder. After processing (approximately 2 hours), you will see the good email back in your Inbox. To reiterate: when the system incorrectly moves good email to the possible_spam, you correct the filter by training it. Train the system about good email by putting the good message(s) into the ham_raw folder.

On-going training.

Put any spam the system missed (spam that has been in your Inbox for more than 2 hours) into your spam_raw folder. If you get good email from a new correspondent, or you get email about a new or unusual topic, it may be a good idea to put the good email into ham_raw. There is no harm in putting any good email into ham_raw, even if the system knows it is good email.

After a few weeks you will only rarely need to move a message into ham_raw.

Spammers are constantly sending new types of spam. This being the case, you will probably need some on-going training to keep the filter up to date.

What about mistakes in training?

If you put a message into the wrong folder and it gets processed, there may be no discernable change in the filter. The wordlist databases are quite large and extensive, so one mail message rarely has much effect. Bayesian systems are statistical in nature, and are quite robust. Bayes theorem cannot be fooled by random lists of keywords, or strange text since such text is (probably) not statistically similar to your normal good email.

There are backups of all your files from the previous day. If necessary, your spam and ham wordlists can be retrieved.

What else can I do?

Don't publish your email address on your web site. Instead use a contact form, or get potential customers to phone you. Your system administrator can provide a web contact form.

Some web sites collect email information by essentially hacking your browser. Most of these sites are disreputable. To prevent this, use an alternate browser profile for this type of browsing and do not use a real email address in that brower profile.

Never answer spam and don't read it. Some HTML emails can detect when you read their spam (via special image tags). Thunderbird has features to prevent this tracking.

Some spam has "instructions to unsubscribe". Do not follow those instructions. Any contact from you confirms to the spammers that you have read the spam, and they will make it a priority to send you more spam.

Do not buy products from companies that send spam.

What is the .bogofilter directory?

Some of you are able to login to the server command line (everyone else ignore the following information). There are no user configurable files in the .bogofilter directory. Files here are the wordlists, system date files, system output files, and the daily email digest of spam filter results.

What is the crontab.txt file?

Don't change this file.

 

For more information use this handy form to contact Tom.