This is default featured post 1 title

Go to Blogger edit html and find these sentences.Now replace these sentences with your own descriptions.This theme is Bloggerized by Lasantha Bandara - Premiumbloggertemplates.com.

This is default featured post 2 title

Go to Blogger edit html and find these sentences.Now replace these sentences with your own descriptions.This theme is Bloggerized by Lasantha Bandara - Premiumbloggertemplates.com.

This is default featured post 3 title

Go to Blogger edit html and find these sentences.Now replace these sentences with your own descriptions.This theme is Bloggerized by Lasantha Bandara - Premiumbloggertemplates.com.

This is default featured post 4 title

Go to Blogger edit html and find these sentences.Now replace these sentences with your own descriptions.This theme is Bloggerized by Lasantha Bandara - Premiumbloggertemplates.com.

This is default featured post 5 title

Go to Blogger edit html and find these sentences.Now replace these sentences with your own descriptions.This theme is Bloggerized by Lasantha Bandara - Premiumbloggertemplates.com.

Wednesday, 8 July 2009

User Manual to run the Spam filter

Dear Reader, This is the User Manual to run the source code. Remember to reference the codes.

User Manual for Window user (Bayesian)

1. Go to the source directories of Bayesian filtering.

2. Open the workspace.

3. Right click on each project and build it.

4. In Visual Studio, set the project setting with the following arguments for training.

[Directories of ham files] [Directory of spam files] [Parser Type].

5. Then for the classification, the setting for the arguments will be [Directories of the spam message] [directories of the word level probability] [parser type]

User Manual for Window user (SVM)

1. Go to the source directories of SVM.

2. Open the workspace.

3. Right click on each project and build it.

4. For training, set the project setting as the following argument.

[directories of spam files] [directories of ham files].

For training the example, set the project setting as the following argument. [the directories of the trainin.dat] [model.dat]

For classification, set the project setting as the following argument.

[directories of the spam message] [directories of the model.dat] [directories of the Word_count.dat]

User commands for Linux User (Bayesian)

1) Go to the application at the UBUNTU GUI and select terminal.

2) In the terminal, type the command to link to the Bayesian directories.

3) To compile the source code, type make-clean then type make. Now the file is compiled and the binary will be generate in the bin directory.

4) To train the dataset, go to the spamfiltering directories and type ./bin/trainer ../spam_files ../ham_files WORD

5) To classify the spam, type ./bin/classifier

User command for Linux User (SVM)

1. Go to the application at the UBUNTU GUI and select terminal.

2. In the terminal, change the directory to SVM_spam by typing cd Desktop\SVM_spam

3. To compile the source code, type make-clean then type make. Now the file is compiled and the binary will be generate in the bin directory.

4. To run the feature extractor, type ./bin/feature_extractor ../spam_files ../ham_files

5 . To run the svm_learn, type ./bin/svm_learn ../training.dat model.data

6. Then type cat training.dat

User Manual to run the Spam filter

Dear Reader, This is the User Manual to run the source code. Remember to reference the codes.

User Manual for Window user (Bayesian)

1. Go to the source directories of Bayesian filtering.

2. Open the workspace.

3. Right click on each project and build it.

4. In Visual Studio, set the project setting with the following arguments for training. [Directories of ham files] [Directory of spam files] [Parser Type].

5. Then for the classification, the setting for the arguments will be [Directories of the spam message] [directories of the word level probability] [parser type]

User Manual for Window user (SVM)

1. Go to the source directories of SVM.

2. Open the workspace.

3. Right click on each project and build it.

4. For training, set the project setting as the following argument.

[directories of spam files] [directories of ham files].

5. For training the example, set the project setting as the following argument. [the directories of the trainin.dat] [model.dat]

6. For classification, set the project setting as the following argument.

[directories of the spam message] [directories of the model.dat] [directories of the Word_count.dat]

User commands for Linux User (Bayesian)

1) Go to the application at the UBUNTU GUI and select terminal.

2) In the terminal, type the command to link to the Bayesian directories

3) To compile the source code, type make-clean then type make. Now the file is compiled and the binary will be generate in the bin directory.

4) To train the dataset, go to the spamfiltering directories and type ./bin/trainer ../spam_files ../ham_files WORD

5) To classify the spam, type ./bin/classifier

User command for Linux User (SVM)

1. Go to the application at the UBUNTU GUI and select terminal.

2. In the terminal, change the directory to SVM_spam by typing cd Desktop\SVM_spam

3. To compile the source code, type make-clean then type make. Now the file is compiled and the binary will be generate in the bin directory.

4. To run the feature extractor, type ./bin/feature_extractor ../spam_files ../ham_files

5. To run the svm_learn, type ./bin/svm_learn ../training.dat model.data

6. Then type cat training.dat

Support Vector Machine Architecture

Introduction to SVM

A support vector machine (SVM) is a computer algorithm that classifies a given example by assigning labels to objects through a number of training examples (William, 2006). This algorithm consists of classification and regression algorithms, which were developed by Vapnik and it is gaining popularity due to many attractive features, and its promising empirical performance. For instance, an SVM can be used in the game development by clustering around the graphics into 3D graphic. Alternatively, an SVM can detect handwritten digits by examine large collection of scanned images of handwritten zeroes, ones and so forth (William, 2006).

SVM algorithms are often based on the Structural Risk Minimization (SRM) principle from statistical learning theory. The role of the SRM is to find an optimal hyper lane for which the lowest true error can be guaranteed. This framework has developed into an e learning algorithm when trained from a finite data set, and formed the ‘true’ performance when used in practice.

For a details explaination on how SVM work, you can download this project which written by me for my Final Year Project at http://rapidshare.com/files/253305852/SVM.docx.html. Anyway, please quote a reference, if you want to take this for future research & development. Thank you.



Tuesday, 7 July 2009

Bayesian filtering architecture



How Bayesian Filtering work?

First of all, there is need to tokenize the email, by separating the message in the email body into small parts. After the message being tokenized, the next process is to map them into the dictionary table which is also known as the frequency table. In this frequency table, the number occurrences of the same words will be analyzed. Then, the probability of the email will be calculated using the Bayes’ theorem rule by categorize whether the words or tokens is spam or non-spam.


The final step is to modify the values of the token in the dictionary, for example, by setting the threshold level by removing fewer frequent items. This process, however, gives a better impact in filtering the binary message. While sometimes binary results are not required, it will still be able to produce the probability of bulk mail being spam. This probability can work in many ways, but most of the Bayesian filtering implemented today will be based on this: those messages that message that are under 0.5% will be judged as non-spam. While the message above the rate of 0.5% which is 0.5%-1% will be judged as possible spam.

How to extract message using N-Gram?
In the N-Gram extraction approach, frequent tokens such as N phrased word are extracted for the use of corpus training. Let 1, g2, gL) be the ordered list (in decreasing frequency) of the most frequent n-grams of the training corpus. Then, each message is represented as a vector of length L <>1, x2, ....., xL>, where xi depends on gi. Two text representation approaches is used in N-Gram process:

1. Binary: The value of xi may be one (if gi is included at least once in the message) or zero (if gi is not included in the message.

Term Frequency (TF): The value of xi corresponds to the frequency of occurrence (normalized by the message length) of gi in the message

How to test the filter?

By using the filter for testing, first the probability is calculated as described below in figure 1, and according to its results the records in the token dictionary are modified. At this point, the value is initialized to one (for the case, where none of the words are matching from the token dictionary). ALL (no. of all e-mails) = SPAM +HAM (number of legitimate letters, added to the number of all spam letters).

-----------------------------------------------------------------------------------------------------------------

- LLet us call a word “matching word”, if the word has existed both in the letter and in the token dictionary.

- (“matching words” | “letter is spam” ) = for all matched words (N1 value of the current word / SPAM).

- P (“matching words” | “letter is legitimate”) = for all matched words (N2 value of the current word / SPAM).

- P (“letter is spam”) = SPAM/ ALL.

- P (“letter is legitimate”) = HAM/ ALL.

- P (“letter is spam” | “matching words”) = P(“letter is legitimate”) * P(“matching words” | letter is legitimate”)

- Fi Final result : P (“letter is spam” | “matching words”) /

P (“letter is legitimate” | “matching words”)

Figure 1: Calculation of the probability

--------------------------------------------------------------------------------------------------------------------------------------

The source codes for the training can be download from the given links from http://rapidshare.com/files/253317113/Spam_Filt.rar.html