If you are dealing with LibSVM, you mus remember the following:
- When building sparse vectors using datatype svm_node, be careful with allocating keys in ascending order. This is clearly specified in the documentation, but sometimes we are too lazy to read it before.
- By default, the outputs of LibSVM, when doing classification are one of {-1,1}. So, do not wait to get real outputs (for instance, distance to the hyperplane), unless you hack the code yourself. If you are doing text categorization, this is good to measure (macro/micro) F1, but not to get a good accuracy.
- You must first preprocess your feature vectors! Joachims proposes using a tf * idf, followed by a L2 normalization (classical Euclidean norm). This is valid for text classification, translating every coordinate value to the interval [0,1]. Other normalization schemes are valid for "classic" classification problems like iris and so (in those cases, the different atributes are scaled independently to [0,1]).
- There is a nasty bug (lack of feature?) in the Java version, at the method "svm_save_model", that makes very slow that procedure, because the output is not buffered. To solve it, find this line:
DataOutputStream fp = new DataOutputStream(new FileOutputStream(model_file_name));
And change it by the following:
DataOutputStream fp = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(model_file_name)));
3 comentarios:
Just a note (a missing new!):
DataOutputStream fp = new DataOutputStream(new BufferedOutputStream(
new FileOutputStream(model_file_name)));
Bye!
PS: this blog is a very good idea! don't give up!
Yeah, you're right. I skipped a "new", now should be correct. Obviously the code was right ;).
PS.: Thanks for your comments and for following me :)
Publicar un comentario