4 dic 2008

Novice problems with LibSVM

If you are dealing with LibSVM, you mus remember the following:
  • When building sparse vectors using datatype svm_node, be careful with allocating keys in ascending order. This is clearly specified in the documentation, but sometimes we are too lazy to read it before.
  • By default, the outputs of LibSVM, when doing classification are one of {-1,1}. So, do not wait to get real outputs (for instance, distance to the hyperplane), unless you hack the code yourself. If you are doing text categorization, this is good to measure (macro/micro) F1, but not to get a good accuracy.
  • You must first preprocess your feature vectors! Joachims proposes using a tf * idf, followed by a L2 normalization (classical Euclidean norm). This is valid for text classification, translating every coordinate value to the interval [0,1]. Other normalization schemes are valid for "classic" classification problems like iris and so (in those cases, the different atributes are scaled independently to [0,1]).
  • There is a nasty bug (lack of feature?) in the Java version, at the method "svm_save_model", that makes very slow that procedure, because the output is not buffered. To solve it, find this line:
    DataOutputStream fp = new DataOutputStream(new FileOutputStream(model_file_name));
    And change it by the following:
    DataOutputStream fp = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(model_file_name)));

3 comentarios:

Alejandro Bellogín dijo...
Este comentario ha sido eliminado por el autor.
Alejandro Bellogín dijo...

Just a note (a missing new!):
DataOutputStream fp = new DataOutputStream(new BufferedOutputStream(
new FileOutputStream(model_file_name)));

Bye!

PS: this blog is a very good idea! don't give up!

Alfonso E. dijo...

Yeah, you're right. I skipped a "new", now should be correct. Obviously the code was right ;).

PS.: Thanks for your comments and for following me :)