6 dic 2008

Make your Lexicon array-free!

The following short note is a very programming-oriented one. I apologize to my more-theoretical reader.

In many IR and TC applications, a common data structure to make fast access to indexed document is the Lexicon. A Lexicon is a set of terms (whatever it will be a "term"), which are easily accessible by identifier or by string. The access by identifier has the clear advantage that it can be done in constant time, while the access by string should visit a binary (red-black?) tree, doing, in the best case a logarithmic-time access (with a certain overhead, because the string should be compared against the strings of all nodes transversed). If we are given the class Term, a common procedure to implement our lexicon could be the following (the implementation is given in Java, but it could be easily translated to C++ or C#):

class Lexicon {
private Map<String, Integer > identifierByString;
private Term terms[];

...

public Term getTermById(int i) { return terms[i]; }

public int size() { return terms.length; }
}

implementation which, given the identifier i of a term, makes easy (just by doing terms[i]) the access to the i-th term. This is, in my opinion, an error that subtracts flexibility to your system. The three main drawbacks of the above design are the following:
  • It assumes that the system has "size()" terms, with identifiers going from 0 to size()-1.
  • It does not allow the removal of a term (for example, if we are doing term selection, a very common procedure in TC, we have to build a second lexicon, a "translation table", and translate the indexed documents to the new lexicon, in order to avoid "unused" term identifiers).
  • The way to "iterate" the whole set of terms is very weak and implementation dependent:
for(int i=0; i<lexicon.size; i++)
{
Term t = lexicon.getTermById(i);
...
}

Of course the first one could not be true in some preprocessed datasets (the set of terms could be starting from 1, and it could have some "holes" in it, due to a previous term selection).

I propose the following and more flexible approach:

class Lexicon {
private Map
<String, Integer> identifierByString;
private HashMap
<Integer, Term> terms;
...
public Term getTermById(int i) { return terms.get(i); }
public Set getTermIdentifiers() { return terms.keySet(); }

public int size() { return terms.size(); }

}

The array has been substituted by a map (a HashMap, if possible), which also gives us constant access time (with of course, a bit of cpu overhead, and additional memory usage).

Note the new introduced method getTermIdentifiers(), whose objective is to give us a substitute to the "weak" way to iterate the lexicon. Now, by doing

for(int term : lexicon.getTermIdentifiers())
{
Term t = lexicon.getTermById(i);
...
}

using the foreach java 5 loop we are sure we are not making any mistakes (like for instance using a non valid term id, causing the subsequent NullPointerException). Note that this version does not guarantee the transversal of the set of terms in increasing order. Never mind, because getting terms necessarily by increasing id should also be avoided (because the id assigned to a term is often based on the place a document was processed, being consequently an arbitrary fact). Remember: ignore underlying structures and program to an interface.

Now, we can write a simple "removeTerm" method in just two lines (remember being coherent and remove the term in both places!):

void removeTerm(int i)
{
identifierByString.remove(termById.get(i).getString());
termById.remove(i);
}

As a conclusion, sometimes is better to sacrifice time and memory if your design is stronger and does not imply a deep knowledge of the underlying structure. In my opinion, arrays should always be avoided unless you are pretty sure (at 100%) that your data is an application of 0..n-1 to a certain set. A good starting strategy, as a "rule of thumb" can be the following: "if the contens are likely to change, be array-free, encapsulate containers, not arrays".

No hay comentarios: