Scholarly publishers throws out Microsoft

After PLOS nature its the turn of microsoft, Life science researchers are in no mood to relent to industry’s interests.

Microsoft’s latest Word release has caused chaos in scholarly publishing circles. Submit a paper to, the journal Nature in Word 2007, and you will face the following warning:

We currently cannot accept files saved in Microsoft Office 2007 formats. Equations and special characters cannot be edited and are incompatible with Nature’s own editing and typesetting programs.’

And it’s not just Nature. Try Science, The Lancet and pretty much any ‘mathematics-intensive’ journal in the world and you will hit the same problem

Science and Nature will no longer accept manuscripts written in Microsoft’s Office 2007 suite. because the latest version of Word is no longer compatible with Mathematical Markup Language (MathML), the de facto standard for writing equations in text documents, according to recent notices posted on the Web sites of both Science and Nature journals. In Office 2007, Microsoft’s own Office MathML (OMML) is used for equations.

And it doesnt end there Microsoft and Sun and open world society are up in arms against each other on adoption of Open Source Document Format. Microsoft supports OOXML and Sun supports ODF (Open Document Format alliance)  which also is enjoying widespread support from academia and corporates like Oracle, IBM, Red Hat, Sun Microsystems, Google

India’s  21-member technical committee decided that India will vote a ‘no’ against Microsoft’s Open Office Extensible Mark Up Language (OOXML) standard at the International Standards Organisation (ISO) in Geneva on September 2.

Online Data sharing for scientists

Brent Edwards director of the Starkey Hearing Research Center in Berkeley, California, who blogs on innovation in science is writing his blog about an article on Nature magazine on online data sharing. Brent comments about the potential of new online data sharing sites such as Swivel and IBM’s Many Eyes . Accoding to the Nature reprt some scientists are already using these new tools to share sequence and microarray data. The potential value from scientists openly sharing their data is huge, possibly akin to the value provided by open-source software development.

Once data are uploaded to these sites (which are still being tested), people can reanalyse the numbers, mix them with other data and visualize them in different ways. Swivel focuses on letting users combine data sets, with some basic ways to present the results such as scatter graphs and bar charts. Many Eyes allows users to generate more complicated graphs such as network diagrams, which depict nodes and connections within networks, and treemaps, which display data as groups of nested rectangles

Despite the availability of many software solutions at the dispoal of scientists many of them still write their own code for bioinformatics and statistical analysis, perhaps the next frontier that might help the comunity could be the development of Firefox like software, that offers some basic functions free of cost, additional function can be bought or acquired free of cost as add ons form researchers, such a move would benefit researchers and students alike,

There are sure many more data sharing website like http://www.gotomyfiles.com, http://www.xdrive.com, http://www.ibackup.com, but these are more of a data storage sites, and these does not offer the level of document collaboration features required by a life science researcher

Then there is few other sites like microsofts foldershare and others that offer features such as remote PC access gotomyPC VNC and webex are a few exmaple of this stable. some of these also allows to by pass even a firewall such as foldershare and can pose serious security risks to data and pc if handled improperly

Bioinformatics Techniques for spam detection

Its not a new topic IBM has discovered that it could use many of the pattern detection techniques and analysis used in bioinformatics in other fields as well.

I thought of adding this as bioinformatics is and microarrays are growing in popularity and decided to give few bytes such articles as wel.

Many of these studies are based on the homology detection. Perhaps going forward the techniques used in SNP detection in SNP microarrays might also find use in other fields notably in spam detection and share market analysis or trends analysis

I find some of the presentation onthe web andd from IBm on using the famous Teiresias algorithm, for spam detection

Chung-Kwei applies advanced pattern matching algorithms developed in IBM’s bioinformatics group to spam detection. This new classification algorithm can detect complex patterns in messages that go beyond the simple word or word phrases used in most algorithms.

A technique originally designed to analyse DNA sequences is the latest weapon in the war against spam. An algorithm named Chung-Kwei (after a feng-shui talisman that protects the home against evil spirits) can catch nearly 97 per cent of spam.

Chung-Kwei is based on the Teiresias algorithm, developed by the bioinformatics research group at IBM’s Thomas J Watson Research Center in New York, US. Teiresias was designed to search different DNA and amino acid sequences for recurring patterns, which often indicate genetic structures
that have an important role.

Instead of chains of characters representing DNA sequences, the research group fed the algorithm 65,000 examples of known spam. Each email was treated as a long, DNA-like chain of characters. Teiresias identified six million recurring patterns in this collection, such as “Viagra”.

Each pattern represented a common sequence of letters and numbers that had appeared in more than one unsolicited message. The researchers then ran a collection of known non-spam (dubbed “ham”) through the same process, and removed the patterns that occurred in both groups.

Genuine email Incoming email was given a score based on how many spam patterns it had. A long email that only had a few spammy sentences would get a relatively low score; but one with many patterns spread across the length of the message would score much higher. The Chung-Kwei correctly identified 64,665 of 66,697 test messages as being spam or 96.56 per cent. More importantly, its rate of misidentifying genuine email as spam was just 1 in 6000 messages. Losing a single email in a torrent of spam is a greater failing in a filter than letting the occasional spam email through.

Chung-Kwei deals with common spammer strategies to dodge pattern-recognition schemes, such as replacing the s with a $, as in “increa$e your $ex power” using its built-in tolerance for different, but
functionally equivalent, DNA sequences. Just as in genetic analysis, Teiresias could be taught that CCC and CCU codons both produce the same amino acid, proline, the anti-spam system an be trained to accept $ and s as identical.

IBM intends to include Chung-Kwei in its commercial product, SpamGuru. Justin Mason, who developed SpamAssassin, one of the most popular open-source anti-spam filters, says that Chung-Kwei looks promising.