May 2004 Archives

At last, there is a realization that information and communication technologies do not necessarily help the 'disadvantaged and vulnerable groups' by the way of some magic. Given that the tools of the economical development in most cases reflect the social structures within which they function, thus 'favoring' the people in 'power', a concentrated effort is needed to ensure that people less likely to 'magically' benefit from such advances do indeed rip the benefit.

The 'Technologies of a Digital World' conference/Expo seems to be an effort in the right direction. At least they are emphasizing that something other than 'magic' needs to be done.

"Technology is an enabler as well as a catalyst to ensure companies operate profitably and governments operate more efficiently in the global environment. But technology should also be the medium for people from all walks of life to harness the new opportunities offered by ICT, and act as fundamental elements for creating new skills and shaping mindsets to churn the engine of the knowledge-economy."
...
The Expo and Seminar, first of its kind to be held in Brunei, carries the theme, 'Technologies of a Digital World' and is centred on the development of technologies suited to the disadvantaged and vulnerable groups and the development of affordable technologies to facilitate people's access to ICT.

From Adam Smith to Open Source

| Permalink

From From Adam Smith to Open Source:

"The Internet is a manifestation of the validity of Adam Smith's theories, as is the growth of Linux, itself, Young argued. The way in which the Internet works and was created is as a distributed system to which multiple self-interests contributed. This resulted in something that was better than any one individual company or government could have ever created."

"Operating-system adoption is driven by the availability of applications, according to Young, which is something that, in early days of its existence, Linux did not have. That said, he added, it was the Internet, itself, with applications like the Apache Web Server, DNS and Sendmail -- all free and open source endeavors -- that serve as further proof of Adam Smith's theory is applied to the growth of the free and open source software movement. "

"The Internet was the killer app that drove the adoption of Linux," said Young."

No comments... the argument is self explanatory.

The qualitative study (Scacchi, 2002) I have selected to critique is published in an electrical engineering oriented scholarly peer-review journal. The author is aware of his quantitative oriented audience and thus from the very beginning sets the expectations that the study is “… not about hypothesis testing or testing the viability of a perspective software engineering methodology or notational form” (p. 24). Similarly to Lincoln and Guba (1985) in defining naturalistic inquiry in terms of what it is not, Scacchi deems it necessary to define a qualitative research in terms that it is not quantitative research. The tensions emerging from the struggle to present non-quantitative type study to a quantitative expecting audience are pervasive throughout the article. Because of these tensions, in the attempt not to alienate his audience, the author has either decided to take many shortcuts—showing in the lack of proper definition and utilization of qualitative methods; or, the author himself is in the process of becoming familiar with various qualitative methods. In the rest of this paper I will concentrate on these struggles, attempts, and what could have been done better, not forgetting that maybe what the author has done is a purposefully chosen middle ground because the audience was not prepared for the full switch from quantitative to qualitative methodology and methods.

The core of this article is to understand the nature and the processes around requirements for the development of open source software (Scacchi, p. 24). Since the open source development framework is a new approach to software development, the author rightfully suggests qualitative methods for doing so: “… investigation of the socio-technical processes, work practices and community forms found in the open source software development. The purpose of this investigation, over several years, is to develop narrative, semi-structured (i.e. hypertextual) and formal computational models of these processes, practices and community forms” (p. 24). The preceding quote also suggest a mix method approach where the findings of the qualitative part of the study (i.e. ‘investigation’) would inform the quantitative part in building computational models. However, this article is restricted to the investigative part of the effort.

the battle against spam goes on: spambayes vs. bogofilter

| Permalink

Note: this entry is part of my class project on experiment and systems evaluation. Only the introduction, limitations, and conclusion sections are included here. For the full paper please take a look at the pdf version.

Introduction

The purpose of this study is to evaluate the effectiveness of two spam filtering software packages in order to decide which one to recommend for further usage. Both Bogofilter (BF) and SpamBayes (SB) are based on the Bayesian probabilistic model (Baeza-Yates & Ribeiro-Neto, 1999, p. 48), as adapted and proposed by Graham (2002; 2003) for spam e-mail identification and filtering. The key to Graham’s Bayesian filtering technique is the ability to train the software with known spam and non-spam (i.e. good) e-mail messages on individual user basis. The idea is that with increased and continued training both packages will be more effective in identifying spam messages, while at the same time decreasing the number of false positives and false negatives.

Both BF and SB are open-source software packages available for free download and use. BF is available for Linux/Unix only and it can be integrated with other mail delivery and filtering tools to automatically tag and filter spam e-mail messages to a separate folder. SB is available for multiple operating system platforms and can be configured to work with Unix/Linux command line mail delivery and filtering tools, as well as POP3, IMAP, the Outlook e-mail client, etc. Detailed instructions are provided at http://bogofilter.sourceforge.net/ and http://spambayes.sourceforge.net/ respectively.

I have used both of these systems and would like to be able to answer the question as to which one of these two systems if more effective at identifying spam.

The Bayesian technique suggests that with continued training the software packages should become more effective in spam identification. Thus, the first research question:

RQ1: Does the spam filtering effectiveness of BF and SB improve as the amount of e-mail messages used for training increases?

From my personal experience it appears that SB is more effective than BF.
Although after good amount of training both SB and BF seem to be very effective in that I rarely get false positives in both of these implementations. Thus, the second research question:

RQ2: Is the spam filtering effectiveness of SB better than BF?

Full paper in pdf version.

Assumptions and limitations

For a more complete analysis of effectiveness, the experiment needs to be repeated with multiple corpuses provided by different individuals due to the uniqueness and various patterns of e-mail use between individuals. The pattern of e-mail messages imbedded in the set3000 corpus is defined by my personal e-mail communication as well as by e-mails I receive at aliases and forwarding e-mail addresses due to my involvement as moderator and administrator of various electronic news and discussion lists.

Additionally, the cap of 3000 messages in the corpus can be modified to see any potential variability in effectiveness. Although I believe that having 3000 messages tested for spam probability seems sufficiently large amount comparable to real life operational spam filtering systems.

The equal proportion of spam and good messages in the training sets might not resemble real life situations. The rate of spam messages received is much higher than good messages, at least in my case. Accounting for variable proportion would improve this experiment. This might even yield an optimal proportion for a given spam cutoff level.

In this experiment, the issue of performance was not considered. Apart from the effectiveness, if one of the systems is to be used for real time spam filtering on a Unix/Linux server supporting thousands of users, SB might be at disadvantage due to its implementation with the python language. BF on the other side is implemented with c++ and runs significantly faster.

Conclusion

Based on the above analyses, the following can be concluded:
• the spam filtering effectiveness of both SB and BF improves with the increased number of training messages
• at each training level SB is more effective than BF

Recommendation: In conjunction with the results on Figure 4 and Table 4 showing the amount of FP, FN, TP, and TN at 0.9 spam cutoff at different training levels, SB is more effective due to the significantly lower number of FN compared to BF. In order to minimize the training effort due to false negatives and false positives, it is recommended that once SB is installed for use, it should be trained with at least 200 or 400 (half good & half spam) messages. At this stage, the number of FP is zero. However, the spamming techniques change as fast (and even faster) in comparison with spam filtering packages. This is an ongoing battle and no software packages can identify all spam messages.

For example, my installation of SB rarely identifies false positives. But once in a while it does, especially when the pattern of spam messages changes all of a sudden or a virus with a unique behavior appears on the scene. I also expect to receive false positives when subscribing to a new discussion list, more so if it is in a different language or topic of interest different from the rest of the discussion lists I’m already subscribed.

The battle against spam will continue as long as spammers have incentives to send spam messages. Spam filtering systems are indeed helpful in reducing the false positives and false negatives. Both SB and BF seem to be designed to eliminate the false positives with as little training as possible. In any case, due diligence and patience is needed by the user. For better effectiveness the user should continuously train the system of choice. To aid in this process, both SB and BF allow for good cutoff level in addition to the spam cutoff level. The messages with spam probabilities between the good cutoff and spam cutoff can be filtered in a separate folder (usually called ‘unsure’) and trained appropriate.

By Mentor Cana, PhD
more info at LinkedIn
email: mcana {[at]} kmentor {[dot]} com

About this Archive

This page is an archive of entries from May 2004 listed from newest to oldest.

April 2004 is the previous archive.

June 2004 is the next archive.

Find recent content on the main index or look in the archives to find all content.