Recently in Research Process Category

In 'response' to Theories informing my research, I would like to bring to attention another issue of concern regarding the empowering or restrictive properties the tacit and explicit theories have on individual's way of thinking and research.

Sooner or later many of us are guided by set of theories, frameworks and paradigms in our research work, some of them tacit and some explicit. They direct our research within the appropriate and relevant scholarly community, thus increasing the chances for scholarly collaboration and communication with like-minded folks.

However, the same theories, paradigms and frameworks also limit our imagination and innovative thinking, they create the box within which we think and operate. Thus, they can have potentially negative effect by filtering away problems and issues that merit scholarly scrutiny but are not scrutinized because our mode of thinking does not allow them to reach us.

In this sense, the explicit theories and frameworks we subscribe to are perhaps less inhibitive to our abilities to explore and innovate beyond our current interests. We are well aware of the explicit theories, we use them to conduct our research, and we can decide to go beyond.

The tacit theories seem to be more inhibitive than the explicit. Because of their tacit nature they direct our research in a way we might not be aware and thus do not know how to go beyond and expand our mode of thinking.

Certainly, there is a benefit in structured way of thinking and research; its awareness helps us position ourselves and our work within the relevant communities of practice. However, often a times the excessive structureness in our way of thinking might be depriving us of the ability to see various phenomena with a new 'eye'.

How does one go about identifying and discovering his/her tacit theories, frameworks and paradigms?

(Originally published Nov 18, 2004)

The qualitative study (Scacchi, 2002) I have selected to critique is published in an electrical engineering oriented scholarly peer-review journal. The author is aware of his quantitative oriented audience and thus from the very beginning sets the expectations that the study is “… not about hypothesis testing or testing the viability of a perspective software engineering methodology or notational form” (p. 24). Similarly to Lincoln and Guba (1985) in defining naturalistic inquiry in terms of what it is not, Scacchi deems it necessary to define a qualitative research in terms that it is not quantitative research. The tensions emerging from the struggle to present non-quantitative type study to a quantitative expecting audience are pervasive throughout the article. Because of these tensions, in the attempt not to alienate his audience, the author has either decided to take many shortcuts—showing in the lack of proper definition and utilization of qualitative methods; or, the author himself is in the process of becoming familiar with various qualitative methods. In the rest of this paper I will concentrate on these struggles, attempts, and what could have been done better, not forgetting that maybe what the author has done is a purposefully chosen middle ground because the audience was not prepared for the full switch from quantitative to qualitative methodology and methods.

The core of this article is to understand the nature and the processes around requirements for the development of open source software (Scacchi, p. 24). Since the open source development framework is a new approach to software development, the author rightfully suggests qualitative methods for doing so: “… investigation of the socio-technical processes, work practices and community forms found in the open source software development. The purpose of this investigation, over several years, is to develop narrative, semi-structured (i.e. hypertextual) and formal computational models of these processes, practices and community forms” (p. 24). The preceding quote also suggest a mix method approach where the findings of the qualitative part of the study (i.e. ‘investigation’) would inform the quantitative part in building computational models. However, this article is restricted to the investigative part of the effort.

the battle against spam goes on: spambayes vs. bogofilter

| Permalink

Note: this entry is part of my class project on experiment and systems evaluation. Only the introduction, limitations, and conclusion sections are included here. For the full paper please take a look at the pdf version.


The purpose of this study is to evaluate the effectiveness of two spam filtering software packages in order to decide which one to recommend for further usage. Both Bogofilter (BF) and SpamBayes (SB) are based on the Bayesian probabilistic model (Baeza-Yates & Ribeiro-Neto, 1999, p. 48), as adapted and proposed by Graham (2002; 2003) for spam e-mail identification and filtering. The key to Graham’s Bayesian filtering technique is the ability to train the software with known spam and non-spam (i.e. good) e-mail messages on individual user basis. The idea is that with increased and continued training both packages will be more effective in identifying spam messages, while at the same time decreasing the number of false positives and false negatives.

Both BF and SB are open-source software packages available for free download and use. BF is available for Linux/Unix only and it can be integrated with other mail delivery and filtering tools to automatically tag and filter spam e-mail messages to a separate folder. SB is available for multiple operating system platforms and can be configured to work with Unix/Linux command line mail delivery and filtering tools, as well as POP3, IMAP, the Outlook e-mail client, etc. Detailed instructions are provided at and respectively.

I have used both of these systems and would like to be able to answer the question as to which one of these two systems if more effective at identifying spam.

The Bayesian technique suggests that with continued training the software packages should become more effective in spam identification. Thus, the first research question:

RQ1: Does the spam filtering effectiveness of BF and SB improve as the amount of e-mail messages used for training increases?

From my personal experience it appears that SB is more effective than BF.
Although after good amount of training both SB and BF seem to be very effective in that I rarely get false positives in both of these implementations. Thus, the second research question:

RQ2: Is the spam filtering effectiveness of SB better than BF?

Full paper in pdf version.

Assumptions and limitations

For a more complete analysis of effectiveness, the experiment needs to be repeated with multiple corpuses provided by different individuals due to the uniqueness and various patterns of e-mail use between individuals. The pattern of e-mail messages imbedded in the set3000 corpus is defined by my personal e-mail communication as well as by e-mails I receive at aliases and forwarding e-mail addresses due to my involvement as moderator and administrator of various electronic news and discussion lists.

Additionally, the cap of 3000 messages in the corpus can be modified to see any potential variability in effectiveness. Although I believe that having 3000 messages tested for spam probability seems sufficiently large amount comparable to real life operational spam filtering systems.

The equal proportion of spam and good messages in the training sets might not resemble real life situations. The rate of spam messages received is much higher than good messages, at least in my case. Accounting for variable proportion would improve this experiment. This might even yield an optimal proportion for a given spam cutoff level.

In this experiment, the issue of performance was not considered. Apart from the effectiveness, if one of the systems is to be used for real time spam filtering on a Unix/Linux server supporting thousands of users, SB might be at disadvantage due to its implementation with the python language. BF on the other side is implemented with c++ and runs significantly faster.


Based on the above analyses, the following can be concluded:
• the spam filtering effectiveness of both SB and BF improves with the increased number of training messages
• at each training level SB is more effective than BF

Recommendation: In conjunction with the results on Figure 4 and Table 4 showing the amount of FP, FN, TP, and TN at 0.9 spam cutoff at different training levels, SB is more effective due to the significantly lower number of FN compared to BF. In order to minimize the training effort due to false negatives and false positives, it is recommended that once SB is installed for use, it should be trained with at least 200 or 400 (half good & half spam) messages. At this stage, the number of FP is zero. However, the spamming techniques change as fast (and even faster) in comparison with spam filtering packages. This is an ongoing battle and no software packages can identify all spam messages.

For example, my installation of SB rarely identifies false positives. But once in a while it does, especially when the pattern of spam messages changes all of a sudden or a virus with a unique behavior appears on the scene. I also expect to receive false positives when subscribing to a new discussion list, more so if it is in a different language or topic of interest different from the rest of the discussion lists I’m already subscribed.

The battle against spam will continue as long as spammers have incentives to send spam messages. Spam filtering systems are indeed helpful in reducing the false positives and false negatives. Both SB and BF seem to be designed to eliminate the false positives with as little training as possible. In any case, due diligence and patience is needed by the user. For better effectiveness the user should continuously train the system of choice. To aid in this process, both SB and BF allow for good cutoff level in addition to the spam cutoff level. The messages with spam probabilities between the good cutoff and spam cutoff can be filtered in a separate folder (usually called ‘unsure’) and trained appropriate.

actor-network theory or ANT ?

| Permalink | 2 Comments

One of the major issues with the actor-network methodology is that there is no ready to used steps/procedures on how to go about operationalizing the various actor-network related concepts. Many of the concepts are dispersed amongst the writings by Latour, Callon, Law, Bijker, Akrich, Hassard, and few other authors. One of the most informative sources is the book "Actor Network Theory and After" by Law & Hassard.

As actor-network theory and methodology got translated into ANT, interestingly enough we see here a theory and methodology a subject of its own theorization through the concept of translation and inscription, many researchers have tried their own particular attempts to operationalization of the concepts relevant for their line of inquiry.

The point I'm trying to make is that we have bits and pieces of attempts to operationalize various actor-network related concepts; however, we lack an overall framework. The answer to why is this is pretty much provided in the above-mentioned book in the chapter "On recalling ANT" (by Latour) stating that actor-network was only meant to be a way of doing ethnomethodology and not a theory (p. 19). So, when people talk of ANT it usually means the theorizing of actor-network in various forms and flavors, while actor-network is more of a way for doing ethnomethodology.

Latour makes the argument that the actual acronym ANT is not simply an acronym. BUT, it is a result of the process of translation by the way which actor-network theory and methodology became ANT (with various flavors). So, the process of translation produced multiple ANT-s, each ANT stressing on different concepts as related to the actor-network methodology/theory.

So, as a result it would seem that ANT has different meanings pertinent to the context and the line of inquiry it is used and applied to. The process of translation is given as the reason.

Latour explains this very clearly in the chapter "On recalling ANT".

What is Logistic Regression?
“Logistic regression allows one to predict a discrete outcome such as group membership from a set of variables that may be continuous, discrete, dichotomous, or a mix.” (Tabachnick and Fidell, 1996, p575)

What is Discriminant Analysis?
“The goal of the discriminant function analysis is to predict group membership from a set of predictors” (Tabachnick and Fidell, 1996, p507)

When/How to use Logistic Regression and Discriminant Analysis?
From the above definitions, it appears that the same research questions can be answered by both methods. The logistic regression may be better suitable for cases when the dependant variable is dichotomous such as Yes/No, Pass/Fail, Healthy/Ill, life/death, etc., while the independent variables can be nominal, ordinal, ratio or interval. The discriminant analysis might be better suited when the dependant variable has more than two groups/categories. However, the real difference in determining which one to use depends on the assumptions regarding the distribution and relationship among the independent variables and the distribution of the dependent variable.

So, what is the difference?
Well, for both methods the categories in the outcome (i.e. the dependent variable) must be mutually exclusive. One of the ways to determine whether to use logistic regression or discriminant analysis in the cases where there are more than two groups in the dependant variable is to analyze the assumptions pertinent to both methods. The logistic regression is much more relaxed and flexible in its assumptions than the discriminant analysis. Unlike the discriminant analysis, the logistic regression does not have the requirements of the independent variables to be normally distributed, linearly related, nor equal variance within each group (Tabachnick and Fidell, 1996, p575). Being free from the assumption of the discriminant analysis, posits the logistic regression as a tool to be used in many situations. However, “when [the] assumptions regarding the distribution of predictors are met, discriminant function analysis may be more powerful and efficient analytic strategy” (Tabachnick and Fidell, 1996, p579).

Even though the logistic regression does not have many assumptions, thus usable in more instances, it does require larger sample size, at least 50 cases per independent variable might be required for an accurate hypothesis testing, especially when the dependant variable has many groups (Grimm and Yarnold, p. 221). However, given the same sample size, if the assumptions of multivariate normality of the independent variables within each group of the dependant variable are met, and each category has the same variance and covariance for the predictors, the discriminant analysis might provide more accurate classification and hypothesis testing (Grimm and Yarnold, p.241). The rule of thumb though is to use logistic regression when the dependant variable is dichotomous and there are enough samples. [194:604]

Grimm, L.G. & Yarnold, P.R. eds. (1995). Reading and Understanding Multivariate Statistics. Washington D.C.: American Psychological Association

Tabachnick, B.G. and Fidell, L.S. (1996). Using Multivariate Statistics. NY: HarperCollins

Rice Virtual Lab in Statistics

| Permalink

Rice Virtual Lab in Statistics

"An online statistics book with links to other statistics resources on the web."

The most relevant aspect of the engineering courses (my background) is the emphasis on the systems mode of thinking which has helped me tremendously in my present course of study here at SCILS, especially in Information Science.

So far, the challenge has been to build a frame of reference or a mindset through which one is able to see the problems related to information science and the resolutions proposed to resolve them. Personally, I believe that the systems way of thinking is a very insightful and powerful tool, especially because helps you study a problem by identifying the boundaries around it, its scope, what happens within the boundaries, and how the issues with the problem at hand interface with the environment (i.e. with outside of the relevantly defined boundary).

Another challenge for me was to adjust to the statistics methods used in social research. Despite the obvious difference between the statistical results of technical systems and those related to the relation between the independent and dependent variables in social phenomena, the statistics background from my engineering courses has helped me in the quest to identifying the conjecture between statistical analysis of engineering data and data gathered from information science experiments. Another benefit of engineering statistical courses is the ability they provide to better understand the fundamental background of the particular statistical tools, in light of the fact that courses that deal with statistics for social research emphasize mostly on usability and applicability of statistics, and do not necessarily stress enough on the actual derivation of the statistical tools and procedures.

The concepts of interconnectivity of various technical elements within the information and communication systems and the multitude of services they carry almost directly relate (albeit at a different level of application) to various practical communication tools and services that affect the social realm. An information and communication system is not a goal in its own; it is produced and used within the social web of interactions composed of human and non-human entities, or networked actors as suggested by the actor-network theory (ANT) and actor-network methodology. Considering that the actor-network theory considers human and non-human entities/elements in its analysis and methodology, it would be interesting to identify and describe a possible link between the variations and changes at the lowest levels of interactions (i.e. technological) and their potential effect on the interaction at the level between a system as a whole and the user(s).

Through these few reflections, I have attempted to link the experience and knowledge I have obtained from my engineering education and systems analyst/eng experience, with the role they have played so far in my PhD. level classes in Information Science. I hope to have more of these sorts of reflections in the future, as they pop-up in my head. :)

more things to learn in the new semester

| Permalink | 1 Comment

I've just created two new categories: 1) Quantitative Research Methods (for class 194:604), and 2) Mass Communication Theory and Research (for class 194:631).

In these two categories I'll be posting comments, ideas, thoughts, and reflections, pertinent to the two classes I'm taking this semester (Fall 2003).

It would be nice to hear if other bloggers are taking similar classes so we can exchange ideas and thoughts, and help each other. :) So far I've identified Edward Bilodeau who will be taking both Qualitative and Quantitative Research classes this semester.

Update (2/1/2004):
I've renamed the above category Quantitative Research Methods into Research Methods, Methodologies, Issues in order to reflect my targeted interest. In this new category I'll be writing about research in general as it pertains to my dissertation interests (for now) and not necessarily only about Quantitative research methods and methodologies.

By Mentor Cana, PhD
more info at LinkedIn
email: mcana {[at]} kmentor {[dot]} com

About this Archive

This page is an archive of recent entries in the Research Process category.

Ph.D. life, etc. is the previous category.

Social Construction is the next category.

Find recent content on the main index or look in the archives to find all content.