10 dic 2008

Temporal files: let the OS manage them

During the development of several works in text categorization, I splitted my software in different programs, each of them with a concrete and non-overlapping purpose. The communication among the different (java) programs was coordinated by different Perl scripts which made my prototyping faster. The basic "scheme of communication" was then, the following: the output of a program, and the previous ones is used as input for the current program. And of course, if some input which was supposed to be there, is not found, my execution was aborted at that point.

This kind of "design" can lead us to wrong computations if a certain program of the list, aborts execution, leaving in our hard disk a corrupted result file. Depending on how good are you at "error management" (one of the most-forgotten parts in the word of software-cycle-development-for-publising-before-that-bloody-deadline), this will require the previous checking of the integrity of the file. Several time ago, I decided to avoid this problems by using temporal files (let understand "temporal files" as a synonym of "handmade temporal files").

First of all, if my program should generate for instance a file called "reuters.index", I first wrote a reuters.tmp, and after all the procedure was finished without any problem, the same program renamed the file to the final name. This scheme is as dangerous as the previous solution. Why? Because, again, your program can fail at an intermediate point, and the error will abort current execution (which is good), but could corrupt future executions (because there is already a temporal file on the disk that your program could understand as own). Moreover, this removes any possibility for making several parallel executions of your software (because all would create the same temporal file having the combination of the outputs). This can be easily avoided if you add a unique identifier as the part of the file (like for example the PID of the file). This pid of the file (which in fact is given by the "$$" variable in Perl) makes the name of temporal files different and avoids the previously mentioned problems.

The only unanswered question that is still left is "what happens with the temporal files of different aborted executions". After my program failed several times, I could find 10 or 12 reuters_XXXX.tmp on my work directory which in fact, were several hundred megabytes each one, and filled my working directory with no more than trash. I found an "elegant" solution: in my main script, I checked at the beginning the existence of temporal files, and if they were in, it deleted them.

If you read until this point and you are an experienced programmer you maybe could have thought that I am a novice. In fact this kind of practice is reinventing the wheel. All modern operating systems (Linux, MacOS, Windows) support the creation and management of temporal files, avoiding the problems commented before, and probably making a better management than us. Moreover, most modern languages (Java, Perl) include functions to use temporal files as they were "normal files", with only one instruction.

For instance, in java, we can write:

File f = File.createTempFile(tempFileName, "tmp");
f.deleteOnExit();

(obviously the second statement is not compulsory if we do not want to delete the file). After that, we can dump our output to that file, as usual, and finally rename it, or even open it to read it. The operating system will delete the file when exiting the program, and the name management will be carried out also by it.

On the other hand, in Perl is as easier as in java (where the "UNLINK=>1" could be used or not):

my ($fh, $filename) = tempfile(UNLINK => 1);

Then, you can use the file handler as usual:

print $fh "HOLA!\n";

And do not forget to include the corresponding packages!:

use File::Temp qw/ tempfile tempdir /;

The lesson learnt is: let the OS be the OS, and you be the scientific. You will write less code, and your error probability will then, be lower. Other way of looking at it could be "use all the power provided by your OS".

No hay comentarios: