Professional blog of Alfonso E. Romero

6 may 2010

Manuscript of the thesis

I have uploaded the manuscript (and the slides) of my thesis (entitled Document Classification Models Based on Bayesian Networks) in the "publications" section of my webpage (so, if you want to have a look at it, you can). All comments will be welcomed.

29 abr 2010

PhD Defended

Defending the Thesis
Originally uploaded by AlfonsoERomero

At last! Also, with a "cum laude" (maximum mark) as the result. Now, it is time to look for a postdoc.

5 abr 2010

Looking for a postdoc

On April 20 27* I will defend my PhD thesis entitled "Document Classification Models based on Bayesian Networks". It has been a long way, and if everything runs normally, I will be a doctor by the end of this month.

In order to improve and open my scientific interests, I have decided to go for a postdoc in Europe. My current research interests are Document Classification, Information Retrieval and Bayesian networks. Anyway, I am interested in all approaches and applications in Machine Learning (not neccesarily documents), but in fact I am opened to any research topic containing a strong theoretical support.

I have a degree (Ingeniería, like BEng + MSc) in Computer Science, and (will have) a PhD in Computer Science, and I think I have a decent list of publications (here are some of them).

So, if you hear of some interesting offer (or if you maybe can give me one), I will be grateful to listen to it. My email: alfonsoeromero (AT) gmail (DOT) com.

_____________________

* Finally, I had some problems due to the famous volcano, and the defense was on April 27.

20 ene 2009

Today it is a good day to start learning Python

I just was trying to solve a problem of my brother, and I found Python really exciting. Of course, it doesn't seem to be as good as Perl for regular expressions, but instead, seems that modularization is very clean, and it also has a very large set of libraries.

Of course, I am not planning moving into Python, but I will keep an eye opened for that language. It could be a nice choice when building a fast prototype :).

3 ene 2009

Liblinear is amazingly fast!

I decided to give up using LibSVM for the linear case because it was not optimized for that. Then, I had a look at liblinear, developed on the same team than LivSVM. Liblinear is recommended for document classification because it removes lots of unuseful operations for the linear case. It has also a (very recent) port to Java, which is located here.

Now, working with 50 categories and 0.5GB of data takes only less than 10 seconds on a Core 2 Duo 2GHz laptop. Those timings are impressive! The interface for this library is very similar to the LibSVM one, so it is very easy to migrate from one library to other. Of course, if you made a good design, all you have to do is to update/change your corresponding facade class.

BTW: Happy new year!

25 dic 2008

Managing your informative networks

One of the main sources of "fresh information" in science is the email. More precisely, mailing lists. Depending on your area of expertise, you can find several email lists, where is possible to searching for:

Interesting scientific discussions (like where can I find the first reference of algorithm X? is there a better way to do this? where is the best source of this topic?).
Postdoctoral positions: jobs that you can take if you already got a PhD and you want to make a lot of ~~money~~new exciting research for a short period (2 or 3 years).
Special numbers of scientific journals: maybe the opportunity to publish a paper in a highly recognized journal in your area, or at least, getting a good feedback on your research.
Conferences: the main source of getting information about new or classic conferences, specially the important dates and the list of interesting topics.

As a researcher between the fields of Information Retrieval and Machine Learning, I used to read the webir list for having the last news on the IR field. Unfortunately, Einat Amitay stop managing the list after 10 years of being there (the list is now closed). Recently, following the advice of my supervisor, I joined the list ML-news which seems to be a very active forum on the area of Machine Learning, but I am still looking for a substitute for webir... Any good suggestion for a mailing list in the field of IR?

On the other hand, what do you think of mailing lists? What lists do you belong to? Do you think email is very 90-ish? Do you trust more in facebook groups? Are you running a Machine Learning twitter account? Feel free to answer, please.

10 dic 2008

Temporal files: let the OS manage them

During the development of several works in text categorization, I splitted my software in different programs, each of them with a concrete and non-overlapping purpose. The communication among the different (java) programs was coordinated by different Perl scripts which made my prototyping faster. The basic "scheme of communication" was then, the following: the output of a program, and the previous ones is used as input for the current program. And of course, if some input which was supposed to be there, is not found, my execution was aborted at that point.

This kind of "design" can lead us to wrong computations if a certain program of the list, aborts execution, leaving in our hard disk a corrupted result file. Depending on how good are you at "error management" (one of the most-forgotten parts in the word of software-cycle-development-for-publising-before-that-bloody-deadline), this will require the previous checking of the integrity of the file. Several time ago, I decided to avoid this problems by using temporal files (let understand "temporal files" as a synonym of "handmade temporal files").

First of all, if my program should generate for instance a file called "reuters.index", I first wrote a reuters.tmp, and after all the procedure was finished without any problem, the same program renamed the file to the final name. This scheme is as dangerous as the previous solution. Why? Because, again, your program can fail at an intermediate point, and the error will abort current execution (which is good), but could corrupt future executions (because there is already a temporal file on the disk that your program could understand as own). Moreover, this removes any possibility for making several parallel executions of your software (because all would create the same temporal file having the combination of the outputs). This can be easily avoided if you add a unique identifier as the part of the file (like for example the PID of the file). This pid of the file (which in fact is given by the "$$" variable in Perl) makes the name of temporal files different and avoids the previously mentioned problems.

The only unanswered question that is still left is "what happens with the temporal files of different aborted executions". After my program failed several times, I could find 10 or 12 reuters_XXXX.tmp on my work directory which in fact, were several hundred megabytes each one, and filled my working directory with no more than trash. I found an "elegant" solution: in my main script, I checked at the beginning the existence of temporal files, and if they were in, it deleted them.

If you read until this point and you are an experienced programmer you maybe could have thought that I am a novice. In fact this kind of practice is reinventing the wheel. All modern operating systems (Linux, MacOS, Windows) support the creation and management of temporal files, avoiding the problems commented before, and probably making a better management than us. Moreover, most modern languages (Java, Perl) include functions to use temporal files as they were "normal files", with only one instruction.

For instance, in java, we can write:

File f = File.createTempFile(tempFileName, "tmp");
f.deleteOnExit();

(obviously the second statement is not compulsory if we do not want to delete the file). After that, we can dump our output to that file, as usual, and finally rename it, or even open it to read it. The operating system will delete the file when exiting the program, and the name management will be carried out also by it.

On the other hand, in Perl is as easier as in java (where the "UNLINK=>1" could be used or not):

my ($fh, $filename) = tempfile(UNLINK => 1);

Then, you can use the file handler as usual:

print $fh "HOLA!\n";

And do not forget to include the corresponding packages!:

use File::Temp qw/ tempfile tempdir /;

The lesson learnt is: let the OS be the OS, and you be the scientific. You will write less code, and your error probability will then, be lower. Other way of looking at it could be "use all the power provided by your OS".