3 dic 2008

Using a SVM library for text categorization

In text categorization, one of my research fields, is impossible to ignore the great power of Support Vector Machines (SVM). Almost everyone agrees that, even in its more primitive form (Linear SVM), they are the killer algorithm to do this task (the one that gets more accuracy). And of course, that implies that any new presented approach to solve this problem should be tested against SVMs.

But, ¿what (open source/free) software packages are available for Support Vector Machines? In fact, the list is very reduced, being the two most popular ones the following:
Both of them are suitable for tasks of text categorization, as they are lightweight implementations, and relatively fasts. Besides, they both include Platt's SMO algorithm in order to make training procedure faster. So, ¿which one can be chosen?

I have tested both of them. In my most recent paper, we have used SVMlight to make a comparison against a Bayesian Network model to classify in a thesaurus environment (and of course, we beat linear SVMs!). That package is amazingly fast, not only due to the language it is written in (C), but due to the great job of Joachims in doing heuristics and other tricks.

In this moment, I am using LibSVM in my Java environment for text categorization (which I expect to release soon as free software), using directly the Java implementation. Althought it is written in a very "C-style" (arrays instead of containers, static methods, no exceptions,...), it is not so bad at speed (obviously it is several times slower than SVMlight, but it is Java, avoiding linking with a non portable library, and keeping the entire system in one language).

From the point of view of software licenses, LibSVM is released under the modified BSD license (a GPL compatible license). This is good, because it allows yo to use this package, even in a non free software environment (I must admit the last point is not really so good). SVMlight , on the other hand, is not free software. The license note claims that:

The program is free for scientific use. Please contact me, if you are planning to use the software for commercial purposes. The software must not be further distributed without prior permission of the author.

This is an important fact for me, and that is why I prefer LibSVM. I must admit that Joachims' work is impressive, and SVMlight is probably a faster and more complete environment, but I can cope with the lack of functionality and speed of LibSVM, because it is free software.

What do you think about this?

No hay comentarios: