Note: this entry is part of my class project on experiment and systems evaluation. Only the introduction, limitations, and conclusion sections are included here. For the full paper please take a look at the pdf version.
The purpose of this study is to evaluate the effectiveness of two spam filtering software packages in order to decide which one to recommend for further usage. Both Bogofilter (BF) and SpamBayes (SB) are based on the Bayesian probabilistic model (Baeza-Yates & Ribeiro-Neto, 1999, p. 48), as adapted and proposed by Graham (2002; 2003) for spam e-mail identification and filtering. The key to Graham’s Bayesian filtering technique is the ability to train the software with known spam and non-spam (i.e. good) e-mail messages on individual user basis. The idea is that with increased and continued training both packages will be more effective in identifying spam messages, while at the same time decreasing the number of false positives and false negatives.
Both BF and SB are open-source software packages available for free download and use. BF is available for Linux/Unix only and it can be integrated with other mail delivery and filtering tools to automatically tag and filter spam e-mail messages to a separate folder. SB is available for multiple operating system platforms and can be configured to work with Unix/Linux command line mail delivery and filtering tools, as well as POP3, IMAP, the Outlook e-mail client, etc. Detailed instructions are provided at http://bogofilter.sourceforge.net/ and http://spambayes.sourceforge.net/ respectively.
I have used both of these systems and would like to be able to answer the question as to which one of these two systems if more effective at identifying spam.
The Bayesian technique suggests that with continued training the software packages should become more effective in spam identification. Thus, the first research question:
RQ1: Does the spam filtering effectiveness of BF and SB improve as the amount of e-mail messages used for training increases?
From my personal experience it appears that SB is more effective than BF.
Although after good amount of training both SB and BF seem to be very effective in that I rarely get false positives in both of these implementations. Thus, the second research question:
RQ2: Is the spam filtering effectiveness of SB better than BF?
Full paper in pdf version.
Assumptions and limitations
For a more complete analysis of effectiveness, the experiment needs to be repeated with multiple corpuses provided by different individuals due to the uniqueness and various patterns of e-mail use between individuals. The pattern of e-mail messages imbedded in the set3000 corpus is defined by my personal e-mail communication as well as by e-mails I receive at aliases and forwarding e-mail addresses due to my involvement as moderator and administrator of various electronic news and discussion lists.
Additionally, the cap of 3000 messages in the corpus can be modified to see any potential variability in effectiveness. Although I believe that having 3000 messages tested for spam probability seems sufficiently large amount comparable to real life operational spam filtering systems.
The equal proportion of spam and good messages in the training sets might not resemble real life situations. The rate of spam messages received is much higher than good messages, at least in my case. Accounting for variable proportion would improve this experiment. This might even yield an optimal proportion for a given spam cutoff level.
In this experiment, the issue of performance was not considered. Apart from the effectiveness, if one of the systems is to be used for real time spam filtering on a Unix/Linux server supporting thousands of users, SB might be at disadvantage due to its implementation with the python language. BF on the other side is implemented with c++ and runs significantly faster.
Based on the above analyses, the following can be concluded:
• the spam filtering effectiveness of both SB and BF improves with the increased number of training messages
• at each training level SB is more effective than BF
Recommendation: In conjunction with the results on Figure 4 and Table 4 showing the amount of FP, FN, TP, and TN at 0.9 spam cutoff at different training levels, SB is more effective due to the significantly lower number of FN compared to BF. In order to minimize the training effort due to false negatives and false positives, it is recommended that once SB is installed for use, it should be trained with at least 200 or 400 (half good & half spam) messages. At this stage, the number of FP is zero. However, the spamming techniques change as fast (and even faster) in comparison with spam filtering packages. This is an ongoing battle and no software packages can identify all spam messages.
For example, my installation of SB rarely identifies false positives. But once in a while it does, especially when the pattern of spam messages changes all of a sudden or a virus with a unique behavior appears on the scene. I also expect to receive false positives when subscribing to a new discussion list, more so if it is in a different language or topic of interest different from the rest of the discussion lists I’m already subscribed.
The battle against spam will continue as long as spammers have incentives to send spam messages. Spam filtering systems are indeed helpful in reducing the false positives and false negatives. Both SB and BF seem to be designed to eliminate the false positives with as little training as possible. In any case, due diligence and patience is needed by the user. For better effectiveness the user should continuously train the system of choice. To aid in this process, both SB and BF allow for good cutoff level in addition to the spam cutoff level. The messages with spam probabilities between the good cutoff and spam cutoff can be filtered in a separate folder (usually called ‘unsure’) and trained appropriate.