11 jun 2011

Rebayct

[This is a post related with a piece of software I released some time ago]

ReBayCT ('Redes Bayesianas para Clasificación en Tesauros', literally in Spanish "Bayesian networks for classification from a Thesaurus") is a console-based tool for performing experiments in Thesaurus-based indexing, that is to say, Text Categorization over the set of descriptors of a thesaurus. For more information in this problem, see [1]. It's written in Java (JDK 5.0 or higher required). The code of project is located here and it is free software (see below).

There are several classifiers implemented in this software. Two baseline (VSM and hierarchical VSM) and one algorithm based in Bayesian networks with versions for unsupervised classification and also supervised. If you use them, please consider citing [2] and [3].

[1] L. M. de Campos, J. M. Fernández-Luna, J. F. Huete, A. E. Romero, Thesaurus Based Automatic Indexing, book chapter in Handbook of Research on Text and Web Mining Technologies. Ed. Idea Group, Inc. USA, 2009, ISBN: 978-1-59904-990-8. Available online at http://www.cs.rhul.ac.uk/~aeromero/pdf/thesaurus.pdf.

[2] L. M. de Campos, A. E. Romero, Bayesian Network Models for Hierarchical Text Classification from a Thesaurus, Int. J. Approx. Reasoning 50(7): 932-944 (2009). Available online at http://www.cs.rhul.ac.uk/~aeromero/pdf/ijar09-thesaurus.pdf.

[3] L. M. de Campos, J. M. Fernández-Luna, J. F. Huete, A. E. Romero, Automatic Indexing from a Thesaurus Using Bayesian Networks: Application to the Classification of Parliamentary Initiatives. ECSQARU 2007: 865-877. In: Lecture Notes in Computer Science 4724 Springer 2007, ISBN 978-3-540-75255-4. Available online at http://www.cs.rhul.ac.uk/~aeromero/pdf/lncs07-ecsqaru-thesaurus.pdf.

The license of the software package is GNU GPL v3. Please check http://www.gnu.org/licenses/gpl.html for more details.

Note (a) to possible users: I am not maintaining this software (except for small bugs) and I'm not working in this research topic now. So, don't wait for a new "major release", because it's never going to come out. If you have any doubts about how to extend or use it, please ask writing a comment to this post or to my gmail account (alfonsoeromero). I'll be glad to answer it and helping with your project. I must recall that derivative works should also be free software, as specified by the GPL license (it should have a compatible license).

Note (b) to possible users: to run this software you need a collection and the EUROVOC (or other) thesaurus in XML. I cannot distribute the EUROVOC, so please try to get a copy yourself (in XML). The dataset I used for experimentation in [2] is not entirely public (parliamentary initiatives of the Parliament of Andalusia), and I prefer to have some "control" about it, due to the fact that I don't have the real ownership of the data (and it's not very clear whether I should be able to distribute it), although it could be obtained by parsing public documents occuring in the Parliament of Andalusia webpage. Anyway, if you need the set of documents, please ask them to me.

No hay comentarios: