Professional blog of Alfonso E. Romero: diciembre 2008

25 dic 2008

Managing your informative networks

One of the main sources of "fresh information" in science is the email. More precisely, mailing lists. Depending on your area of expertise, you can find several email lists, where is possible to searching for:

Interesting scientific discussions (like where can I find the first reference of algorithm X? is there a better way to do this? where is the best source of this topic?).
Postdoctoral positions: jobs that you can take if you already got a PhD and you want to make a lot of ~~money~~new exciting research for a short period (2 or 3 years).
Special numbers of scientific journals: maybe the opportunity to publish a paper in a highly recognized journal in your area, or at least, getting a good feedback on your research.
Conferences: the main source of getting information about new or classic conferences, specially the important dates and the list of interesting topics.

As a researcher between the fields of Information Retrieval and Machine Learning, I used to read the webir list for having the last news on the IR field. Unfortunately, Einat Amitay stop managing the list after 10 years of being there (the list is now closed). Recently, following the advice of my supervisor, I joined the list ML-news which seems to be a very active forum on the area of Machine Learning, but I am still looking for a substitute for webir... Any good suggestion for a mailing list in the field of IR?

On the other hand, what do you think of mailing lists? What lists do you belong to? Do you think email is very 90-ish? Do you trust more in facebook groups? Are you running a Machine Learning twitter account? Feel free to answer, please.

10 dic 2008

Temporal files: let the OS manage them

During the development of several works in text categorization, I splitted my software in different programs, each of them with a concrete and non-overlapping purpose. The communication among the different (java) programs was coordinated by different Perl scripts which made my prototyping faster. The basic "scheme of communication" was then, the following: the output of a program, and the previous ones is used as input for the current program. And of course, if some input which was supposed to be there, is not found, my execution was aborted at that point.

This kind of "design" can lead us to wrong computations if a certain program of the list, aborts execution, leaving in our hard disk a corrupted result file. Depending on how good are you at "error management" (one of the most-forgotten parts in the word of software-cycle-development-for-publising-before-that-bloody-deadline), this will require the previous checking of the integrity of the file. Several time ago, I decided to avoid this problems by using temporal files (let understand "temporal files" as a synonym of "handmade temporal files").

First of all, if my program should generate for instance a file called "reuters.index", I first wrote a reuters.tmp, and after all the procedure was finished without any problem, the same program renamed the file to the final name. This scheme is as dangerous as the previous solution. Why? Because, again, your program can fail at an intermediate point, and the error will abort current execution (which is good), but could corrupt future executions (because there is already a temporal file on the disk that your program could understand as own). Moreover, this removes any possibility for making several parallel executions of your software (because all would create the same temporal file having the combination of the outputs). This can be easily avoided if you add a unique identifier as the part of the file (like for example the PID of the file). This pid of the file (which in fact is given by the "$$" variable in Perl) makes the name of temporal files different and avoids the previously mentioned problems.

The only unanswered question that is still left is "what happens with the temporal files of different aborted executions". After my program failed several times, I could find 10 or 12 reuters_XXXX.tmp on my work directory which in fact, were several hundred megabytes each one, and filled my working directory with no more than trash. I found an "elegant" solution: in my main script, I checked at the beginning the existence of temporal files, and if they were in, it deleted them.

If you read until this point and you are an experienced programmer you maybe could have thought that I am a novice. In fact this kind of practice is reinventing the wheel. All modern operating systems (Linux, MacOS, Windows) support the creation and management of temporal files, avoiding the problems commented before, and probably making a better management than us. Moreover, most modern languages (Java, Perl) include functions to use temporal files as they were "normal files", with only one instruction.

For instance, in java, we can write:

File f = File.createTempFile(tempFileName, "tmp");
f.deleteOnExit();

(obviously the second statement is not compulsory if we do not want to delete the file). After that, we can dump our output to that file, as usual, and finally rename it, or even open it to read it. The operating system will delete the file when exiting the program, and the name management will be carried out also by it.

On the other hand, in Perl is as easier as in java (where the "UNLINK=>1" could be used or not):

my ($fh, $filename) = tempfile(UNLINK => 1);

Then, you can use the file handler as usual:

print $fh "HOLA!\n";

And do not forget to include the corresponding packages!:

use File::Temp qw/ tempfile tempdir /;

The lesson learnt is: let the OS be the OS, and you be the scientific. You will write less code, and your error probability will then, be lower. Other way of looking at it could be "use all the power provided by your OS".

6 dic 2008

Make your Lexicon array-free!

The following short note is a very programming-oriented one. I apologize to my more-theoretical reader.

In many IR and TC applications, a common data structure to make fast access to indexed document is the Lexicon. A Lexicon is a set of terms (whatever it will be a "term"), which are easily accessible by identifier or by string. The access by identifier has the clear advantage that it can be done in constant time, while the access by string should visit a binary (red-black?) tree, doing, in the best case a logarithmic-time access (with a certain overhead, because the string should be compared against the strings of all nodes transversed). If we are given the class Term, a common procedure to implement our lexicon could be the following (the implementation is given in Java, but it could be easily translated to C++ or C#):

class Lexicon {
private Map<String, Integer > identifierByString;
private Term terms[];

...

public Term getTermById(int i) { return terms[i]; }

public int size() { return terms.length; }
}

implementation which, given the identifier i of a term, makes easy (just by doing terms[i]) the access to the i-th term. This is, in my opinion, an error that subtracts flexibility to your system. The three main drawbacks of the above design are the following:

It assumes that the system has "size()" terms, with identifiers going from 0 to size()-1.
It does not allow the removal of a term (for example, if we are doing term selection, a very common procedure in TC, we have to build a second lexicon, a "translation table", and translate the indexed documents to the new lexicon, in order to avoid "unused" term identifiers).
The way to "iterate" the whole set of terms is very weak and implementation dependent:

for(int i=0; i<lexicon.size; i++)
{
Term t = lexicon.getTermById(i);
...
}

Of course the first one could not be true in some preprocessed datasets (the set of terms could be starting from 1, and it could have some "holes" in it, due to a previous term selection).

I propose the following and more flexible approach:

class Lexicon {
private Map<String, Integer> identifierByString;
private HashMap<Integer, Term> terms;
...
public Term getTermById(int i) { return terms.get(i); }
public Set getTermIdentifiers() { return terms.keySet(); }

public int size() { return terms.size(); }

}

The array has been substituted by a map (a HashMap, if possible), which also gives us constant access time (with of course, a bit of cpu overhead, and additional memory usage).

Note the new introduced method getTermIdentifiers(), whose objective is to give us a substitute to the "weak" way to iterate the lexicon. Now, by doing

for(int term : lexicon.getTermIdentifiers())
{
Term t = lexicon.getTermById(i);
...
}

using the foreach java 5 loop we are sure we are not making any mistakes (like for instance using a non valid term id, causing the subsequent NullPointerException). Note that this version does not guarantee the transversal of the set of terms in increasing order. Never mind, because getting terms necessarily by increasing id should also be avoided (because the id assigned to a term is often based on the place a document was processed, being consequently an arbitrary fact). Remember: ignore underlying structures and program to an interface.

Now, we can write a simple "removeTerm" method in just two lines (remember being coherent and remove the term in both places!):

void removeTerm(int i)
{
identifierByString.remove(termById.get(i).getString());
termById.remove(i);
}

As a conclusion, sometimes is better to sacrifice time and memory if your design is stronger and does not imply a deep knowledge of the underlying structure. In my opinion, arrays should always be avoided unless you are pretty sure (at 100%) that your data is an application of 0..n-1 to a certain set. A good starting strategy, as a "rule of thumb" can be the following: "if the contens are likely to change, be array-free, encapsulate containers, not arrays".

5 dic 2008

LibSVM integrated

I finally finished integration of LibSVM in my software. Trying to reproduce Joachims' results on reuters and ohsumed-23 I got the following on the micro-averaged breakeven point:

On reuters: 84.2 (Joachims), 85.9 (me).
On ohsumed: 60.7 (Joachims), 64.8 (me).

(the reference paper is this).

The differences can be due to the difference on the stopword list (I used the famous 571 words of the SMART system which is almost a standard) and my own processing procedure (I remove all punctuation marks). Indeed the results are really good, but the great difference in ohsumed is mysterious...

By the way, training time in my Core 2 Duo 2Ghz, for LibSVM is 4m28s, and classification 1m42s. It is the Java version, but it is still affordable. Who said SVMs were slow?

On the following days, I will try to improve my k-NN implementation (at this time, it has no inverted index, and so is terrifyingly slow), and to include another Bayesian network classifier (Sahami's "limited dependence bayesian classifier"), which I think could be improved in some way to make it competitive with SVMs.

4 dic 2008

Novice problems with LibSVM

If you are dealing with LibSVM, you mus remember the following:

When building sparse vectors using datatype svm_node, be careful with allocating keys in ascending order. This is clearly specified in the documentation, but sometimes we are too lazy to read it before.
By default, the outputs of LibSVM, when doing classification are one of {-1,1}. So, do not wait to get real outputs (for instance, distance to the hyperplane), unless you hack the code yourself. If you are doing text categorization, this is good to measure (macro/micro) F1, but not to get a good accuracy.
You must first preprocess your feature vectors! Joachims proposes using a tf * idf, followed by a L2 normalization (classical Euclidean norm). This is valid for text classification, translating every coordinate value to the interval [0,1]. Other normalization schemes are valid for "classic" classification problems like iris and so (in those cases, the different atributes are scaled independently to [0,1]).
There is a nasty bug (lack of feature?) in the Java version, at the method "svm_save_model", that makes very slow that procedure, because the output is not buffered. To solve it, find this line:
DataOutputStream fp = new DataOutputStream(new FileOutputStream(model_file_name));
And change it by the following:
DataOutputStream fp = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(model_file_name)));

3 dic 2008

Using a SVM library for text categorization

In text categorization, one of my research fields, is impossible to ignore the great power of Support Vector Machines (SVM). Almost everyone agrees that, even in its more primitive form (Linear SVM), they are the killer algorithm to do this task (the one that gets more accuracy). And of course, that implies that any new presented approach to solve this problem should be tested against SVMs.

But, ¿what (open source/free) software packages are available for Support Vector Machines? In fact, the list is very reduced, being the two most popular ones the following:

SVM^light, a C implementation written by Thorsten Joachims.
LibSVM, a C (and Java) implementation written by Chih-Jen Lin.

Both of them are suitable for tasks of text categorization, as they are lightweight implementations, and relatively fasts. Besides, they both include Platt's SMO algorithm in order to make training procedure faster. So, ¿which one can be chosen?

I have tested both of them. In my most recent paper, we have used SVM^light to make a comparison against a Bayesian Network model to classify in a thesaurus environment (and of course, we beat linear SVMs!). That package is amazingly fast, not only due to the language it is written in (C), but due to the great job of Joachims in doing heuristics and other tricks.

In this moment, I am using LibSVM in my Java environment for text categorization (which I expect to release soon as free software), using directly the Java implementation. Althought it is written in a very "C-style" (arrays instead of containers, static methods, no exceptions,...), it is not so bad at speed (obviously it is several times slower than SVM^light, but it is Java, avoiding linking with a non portable library, and keeping the entire system in one language).

From the point of view of software licenses, LibSVM is released under the modified BSD license (a GPL compatible license). This is good, because it allows yo to use this package, even in a non free software environment (I must admit the last point is not really so good). SVM^light, on the other hand, is not free software. The license note claims that:

The program is free for scientific use. Please contact me, if you are planning to use the software for commercial purposes. The software must not be further distributed without prior permission of the author.

This is an important fact for me, and that is why I prefer LibSVM. I must admit that Joachims' work is impressive, and SVM^lightis probably a faster and more complete environment, but I can cope with the lack of functionality and speed of LibSVM, because it is free software.

What do you think about this?

2 dic 2008

H-Index and so

The Hirsch Index (or in a sort way, the h-Index) is a way to measure scientific popularity by one number. A scientific with an h-Index of n, will have n papers with at least n citations each one. Note that I am saying that it can measure "scientific popularity", neither "scientific excelence" nor "scientific quality" (although there is often a correlation between those three facts).

h-index is sometimes a way to self-glorification, other times it hides a collaborative mafia ("I will cite you if you cite me"), but I like it (and so scientific community do). It is only a metric, but, as in other metrics, is a quantitative way to measure the importance of a scientific in its community.

As you can see, h-index grows in an exponential manner: when you get your first citation, you get your h=1. To get h=2 you need, either to get (al least) one more citation on that paper, and (at least) two more in a different one, or get (at least) two citations in different papers. That means that, stepping from an h-index of n-2 to n-1 is quite easy than doing the same from n-1 to n (because in every step you need more and more citations).

A tool which is helpful to compute this index is "Publish or Perish", available here. Using google scholar and other similar services, this tool can easily compute the h-index of a researcher, in a semi-automatic way (you often have to discard manually several publications that are not coming from that author, and of course, self-citations).

By the way, at the present day (2/12/2008) my h-index is 1. It is not so bad for a PhD student, but I hope it would be improved next year...

What do I understand by "a professional blog"?

Hi everyone!

As a researcher in the field of Computer Science, I have decided to open a "professional" blog. What I mean by "professional" is that all the contents of this blog are going to be exclusively related with my research and current job (I am a funded student at the University of Granada, Spain, in the Department of Computer Science and Artificial Intelligence). Then, the posts of this blog could comprise:

Publications I have made, with some personal comments.
Conferences I have attended and/or being a speaker (that could include also some interesting pics!).
Interesting papers I have read (also with personal comments).
Software that I have released, or software that I have found interesting or useful.
Random thoughts (i.e. opinion) on Computer Science, Information Retrieval, Machine Learning, or science in general.
Interesting links (coming from interesting sources).
Miscellaneous information (about this blog, or about me, always related with my job).

This is what I understand by a "professional" blog. Other suggestions are also welcomed.

Professional blog of Alfonso E. Romero