Bioinformatics Techniques for spam detection

Its not a new topic IBM has discovered that it could use many of the pattern detection techniques and analysis used in bioinformatics in other fields as well.

I thought of adding this as bioinformatics is and microarrays are growing in popularity and decided to give few bytes such articles as wel.

Many of these studies are based on the homology detection. Perhaps going forward the techniques used in SNP detection in SNP microarrays might also find use in other fields notably in spam detection and share market analysis or trends analysis

I find some of the presentation onthe web andd from IBm on using the famous Teiresias algorithm, for spam detection

Chung-Kwei applies advanced pattern matching algorithms developed in IBM’s bioinformatics group to spam detection. This new classification algorithm can detect complex patterns in messages that go beyond the simple word or word phrases used in most algorithms.

A technique originally designed to analyse DNA sequences is the latest weapon in the war against spam. An algorithm named Chung-Kwei (after a feng-shui talisman that protects the home against evil spirits) can catch nearly 97 per cent of spam.

Chung-Kwei is based on the Teiresias algorithm, developed by the bioinformatics research group at IBM’s Thomas J Watson Research Center in New York, US. Teiresias was designed to search different DNA and amino acid sequences for recurring patterns, which often indicate genetic structures
that have an important role.

Instead of chains of characters representing DNA sequences, the research group fed the algorithm 65,000 examples of known spam. Each email was treated as a long, DNA-like chain of characters. Teiresias identified six million recurring patterns in this collection, such as “Viagra”.

Each pattern represented a common sequence of letters and numbers that had appeared in more than one unsolicited message. The researchers then ran a collection of known non-spam (dubbed “ham”) through the same process, and removed the patterns that occurred in both groups.

Genuine email Incoming email was given a score based on how many spam patterns it had. A long email that only had a few spammy sentences would get a relatively low score; but one with many patterns spread across the length of the message would score much higher. The Chung-Kwei correctly identified 64,665 of 66,697 test messages as being spam or 96.56 per cent. More importantly, its rate of misidentifying genuine email as spam was just 1 in 6000 messages. Losing a single email in a torrent of spam is a greater failing in a filter than letting the occasional spam email through.

Chung-Kwei deals with common spammer strategies to dodge pattern-recognition schemes, such as replacing the s with a $, as in “increa$e your $ex power” using its built-in tolerance for different, but
functionally equivalent, DNA sequences. Just as in genetic analysis, Teiresias could be taught that CCC and CCU codons both produce the same amino acid, proline, the anti-spam system an be trained to accept $ and s as identical.

IBM intends to include Chung-Kwei in its commercial product, SpamGuru. Justin Mason, who developed SpamAssassin, one of the most popular open-source anti-spam filters, says that Chung-Kwei looks promising.

 

 

 

Advertisements

A little bit of fun- I did a PhD and did NOT go mad by Richard Butterworth from university of Middlesex.

Cartoon of person looking mad

I did a PhD and did NOT go mad

Before reading these wise words advising you how to do a PhD (inspired by three years of the author carefully and diligently banging his head on a table) you are requested to read and digest the following irony…

The only way to find out  how to do a PhD is to do one. Therefore all advice is useless.

To say that I enjoyed doing my PhD would be a lie, not just an ordinary lie mind you. More the sort of lie one would normally associate with Tory party conferences. A big wobbly lie with a dusting of sugar on top. At times I hated my PhD, so why do I have any authority to give advice on doing a PhD? Well, I don’t claim to have any — other than the fact that I completed and passed the thing, so I must have done somthing right.

Read on at http://www.cs.mdx.ac.uk/staffpages/richardb/PhDtalk.html

standardization in microarray analysis software industry

scouting for the right software for the microarray analysis software , kept me thinkng why despite these software being used by scores or scientists no one has come forward to create what can be called as a standard for such software, the confusion rains in this field as one company’s software data do not work with another one and vice versa, For an industry like biology and drug discovery  that is trying to benefit from the knowledge of mathematics statitics and chemistry physics inability to port data across platform is a serious roadblock. there are standards such as MIAMe and MAGE but these are just data standards, not for softwares, I believe ther should be  something similar to ISO standards, SEI CMI etc.

majority of the newsgroup and forums are used by graduate and at times senor researchers to find out which is the best software to be used, I thought of starting a wiki page where researchers can post their comments and rate the products and compare the features against each other,

Hello world!

Welcome to WordPress.com. This is your first post. Edit or delete it and start blogging!

%d bloggers like this: