<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-986368889047854119</id><updated>2011-09-29T06:34:41.033+02:00</updated><category term='text classification'/><category term='postdoc'/><category term='research'/><category term='python'/><category term='software'/><category term='programming'/><category term='self.blog'/><category term='myself'/><category term='svm'/><category term='work'/><category term='papers'/><category term='science'/><category term='networks'/><title type='text'>Professional blog of Alfonso E. Romero</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://alfonsoeromero.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://alfonsoeromero.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Alfonso E.</name><uri>http://www.blogger.com/profile/18003152267896724194</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_I_EljbfWDNQ/SLa8uB8XfCI/AAAAAAAAAMw/eDpXIrUTWO4/S220/reich.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>17</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-986368889047854119.post-7136201408322834874</id><published>2011-06-11T21:45:00.000+02:00</published><updated>2011-06-11T21:45:31.710+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='research'/><category scheme='http://www.blogger.com/atom/ns#' term='software'/><title type='text'>Rebayct</title><content type='html'>&lt;div style="line-height: 1.25em; max-width: 64em;"&gt;&lt;div style="text-align: justify;"&gt;&lt;strong&gt;[This is a post related with a piece of software I released some time ago]&lt;/strong&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;strong&gt;&lt;br /&gt;&lt;/strong&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;strong&gt;ReBayCT&lt;/strong&gt;&amp;nbsp;('Redes Bayesianas para Clasificación en Tesauros', literally in Spanish "Bayesian networks for classification from a Thesaurus") is a console-based tool for performing experiments in Thesaurus-based indexing, that is to say, Text Categorization over the set of descriptors of a thesaurus. For more information in this problem, see&amp;nbsp;[&lt;strong&gt;1]&lt;/strong&gt;. It's written in Java (JDK 5.0 or higher required). The code of project is located &lt;a href="http://code.google.com/p/rebayct/"&gt;here&lt;/a&gt;&amp;nbsp;and it is &lt;b&gt;free software&lt;/b&gt; (see below).&lt;/div&gt;&lt;/div&gt;&lt;div style="line-height: 1.25em; max-width: 64em;"&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="line-height: 1.25em; max-width: 64em;"&gt;&lt;div style="text-align: justify;"&gt;There are several classifiers implemented in this software. Two &lt;i&gt;baseline&lt;/i&gt; (VSM and hierarchical VSM) and one algorithm based in Bayesian networks with versions for unsupervised classification and also supervised. If you use them, please consider citing&amp;nbsp;[&lt;strong&gt;2]&lt;/strong&gt;&amp;nbsp;and&amp;nbsp;[&lt;strong&gt;3].&lt;/strong&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="line-height: 1.25em; max-width: 64em;"&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="line-height: 1.25em; max-width: 64em;"&gt;&lt;div style="text-align: justify;"&gt;&lt;strong&gt;&lt;strong&gt;[1]&lt;/strong&gt;&amp;nbsp;&lt;/strong&gt;L. M. de Campos, J. M. Fernández-Luna, J. F. Huete, A. E. Romero,&amp;nbsp;&lt;i&gt;Thesaurus Based Automatic Indexing&lt;/i&gt;, book chapter in Handbook of Research on Text and Web Mining Technologies. Ed. Idea Group, Inc. USA, 2009, ISBN: 978-1-59904-990-8. Available online at&amp;nbsp;&lt;a href="http://www.cs.rhul.ac.uk/~aeromero/pdf/thesaurus.pdf" rel="nofollow" style="color: #0000cc;"&gt;http://www.cs.rhul.ac.uk/~aeromero/pdf/thesaurus.pdf&lt;/a&gt;.&lt;/div&gt;&lt;/div&gt;&lt;div style="line-height: 1.25em; max-width: 64em;"&gt;&lt;div style="text-align: justify;"&gt;&lt;strong&gt;&lt;strong&gt;&lt;br /&gt;&lt;/strong&gt;&lt;/strong&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;strong&gt;&lt;strong&gt;[2]&lt;/strong&gt;&amp;nbsp;&lt;/strong&gt;L. M. de Campos, A. E. Romero,&amp;nbsp;&lt;i&gt;Bayesian Network Models for Hierarchical Text Classification from a Thesaurus&lt;/i&gt;, Int. J. Approx. Reasoning 50(7): 932-944 (2009). Available online at&amp;nbsp;&lt;a href="http://www.cs.rhul.ac.uk/~aeromero/pdf/ijar09-thesaurus.pdf" rel="nofollow" style="color: #0000cc;"&gt;http://www.cs.rhul.ac.uk/~aeromero/pdf/ijar09-thesaurus.pdf&lt;/a&gt;.&lt;/div&gt;&lt;/div&gt;&lt;div style="line-height: 1.25em; max-width: 64em;"&gt;&lt;div style="text-align: justify;"&gt;&lt;strong&gt;&lt;strong&gt;&lt;br /&gt;&lt;/strong&gt;&lt;/strong&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;strong&gt;&lt;strong&gt;[3]&lt;/strong&gt;&amp;nbsp;&lt;/strong&gt;L. M. de Campos, J. M. Fernández-Luna, J. F. Huete, A. E. Romero,&amp;nbsp;&lt;i&gt;Automatic Indexing from a Thesaurus Using Bayesian Networks: Application to the Classification of Parliamentary Initiatives&lt;/i&gt;. ECSQARU 2007: 865-877. In: Lecture Notes in Computer Science 4724 Springer 2007, ISBN 978-3-540-75255-4. Available online at&amp;nbsp;&lt;a href="http://www.cs.rhul.ac.uk/~aeromero/pdf/lncs07-ecsqaru-thesaurus.pdf" rel="nofollow" style="color: #0000cc;"&gt;http://www.cs.rhul.ac.uk/~aeromero/pdf/lncs07-ecsqaru-thesaurus.pdf&lt;/a&gt;.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="text-align: justify;"&gt;The license of the software package is &lt;b&gt;GNU GPL v3&lt;/b&gt;. Please check&amp;nbsp;&lt;a href="http://www.gnu.org/licenses/gpl.html"&gt;http://www.gnu.org/licenses/gpl.html&lt;/a&gt;&amp;nbsp;for more details.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;Note (a) to possible users:&lt;/b&gt; I am not maintaining this software (except for small bugs) and I'm not working in this research topic now. So, don't wait for a new "major release", because it's never going to come out. If you have any doubts about how to extend or use it, please ask writing a comment to this post or to my gmail account (&lt;i&gt;alfonsoeromero&lt;/i&gt;). I'll be glad to answer it and helping with your project. I must recall that &lt;b&gt;derivative works should also be free software&lt;/b&gt;, as specified by the GPL license (it should have a compatible license).&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;Note (b) to possible users: &lt;/b&gt;to run this software you need a collection and the &lt;i&gt;EUROVOC&lt;/i&gt; (or other) &lt;i&gt;thesaurus &lt;/i&gt;in XML. I cannot distribute the &lt;a href="http://eurovoc.europa.eu/"&gt;EUROVOC&lt;/a&gt;, so please try to get a copy yourself (in XML). The dataset I used for experimentation in [2] is not entirely public (parliamentary initiatives of the Parliament of Andalusia), and I prefer to have some "control" about it, due to the fact that &lt;u&gt;I don't have the real ownership of the data&lt;/u&gt; (and it's not very clear whether I should be able to distribute it), although it could be obtained by parsing public documents occuring in the &lt;a href="http://www.parlamentodeandalucia.es/webdinamica/portal-web-parlamento/inicio.do"&gt;Parliament of Andalusia webpage&lt;/a&gt;. Anyway, if you need the set of documents, please ask them to me.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/986368889047854119-7136201408322834874?l=alfonsoeromero.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alfonsoeromero.blogspot.com/feeds/7136201408322834874/comments/default' title='Enviar comentarios'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=986368889047854119&amp;postID=7136201408322834874' title='0 comentarios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/7136201408322834874'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/7136201408322834874'/><link rel='alternate' type='text/html' href='http://alfonsoeromero.blogspot.com/2011/06/rebayct.html' title='Rebayct'/><author><name>Alfonso E.</name><uri>http://www.blogger.com/profile/18003152267896724194</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_I_EljbfWDNQ/SLa8uB8XfCI/AAAAAAAAAMw/eDpXIrUTWO4/S220/reich.jpg'/></author><thr:total>0</thr:total><georss:featurename>Staines, Surrey, Reino Unido</georss:featurename><georss:point>51.4350161 -0.508783200000039</georss:point><georss:box>51.3938166 -0.579615200000039 51.476215599999996 -0.43795120000003895</georss:box></entry><entry><id>tag:blogger.com,1999:blog-986368889047854119.post-6371270084599858030</id><published>2011-03-18T19:15:00.001+01:00</published><updated>2011-03-18T19:20:51.958+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='papers'/><title type='text'>New paper: "Image zooming based on sampling theorems"</title><content type='html'>&lt;div style="text-align: justify;"&gt;Together with my good friend, Prof. &lt;a href="http://www4.ujaen.es/~jmalmira/"&gt;José M. Almira&lt;/a&gt;, I have published in &lt;a href="http://mat.uab.es/~matmat/"&gt;&lt;b&gt;&lt;i&gt;Materials Matemàtics&lt;/i&gt;&lt;/b&gt;&lt;/a&gt; a paper entitled &lt;b&gt;&lt;i&gt;"Image zooming based on sampling theorems"&lt;/i&gt;&lt;/b&gt;&amp;nbsp;which reviews some classic zooming methods (specifically the 'sinc interpolation') used in the field of digital image processing. It is a review paper, where we have tried to be precise the compilation of the literature in this topic, and giving a formal notation, from a mathematical point of view, of the process of zooming an image.&amp;nbsp;As the &lt;b&gt;abstract &lt;/b&gt;says:&lt;/div&gt;&lt;blockquote style="text-align: justify;"&gt;&lt;i&gt;In this paper we introduce two digital zoom methods based on sampling&amp;nbsp;&lt;/i&gt;&lt;i&gt;theory and we study their mathematical foundation. The first one (usua&lt;/i&gt;&lt;i&gt;lly known by the names of ‘sinc interpolation’, ‘zero-padding’ and ‘Fourier&amp;nbsp;&lt;/i&gt;&lt;i&gt;zoom’) is commonly used by the image processing community.&lt;/i&gt;&lt;/blockquote&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;The paper is online &lt;a href="http://mat.uab.es/~matmat/PDFv2011/v2011n01.pdf"&gt;here&lt;/a&gt;&lt;/b&gt;, as the journal is electronic, and can be seen and downloaded without costs of any kind. I highly encourage to read it if you are insterested in &lt;i&gt;how digital images can be zoomed&lt;/i&gt; in programs like &lt;a href="http://gimp.org/"&gt;The Gimp&lt;/a&gt; or Photoshop.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;I must thank the editors&lt;/b&gt; and add that the final version of the paper is impressive, due to the nice LaTeX style used in the journal and the careful edition they have made, polishing its content, and adding some descriptive images to those we already provided.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/986368889047854119-6371270084599858030?l=alfonsoeromero.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alfonsoeromero.blogspot.com/feeds/6371270084599858030/comments/default' title='Enviar comentarios'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=986368889047854119&amp;postID=6371270084599858030' title='0 comentarios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/6371270084599858030'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/6371270084599858030'/><link rel='alternate' type='text/html' href='http://alfonsoeromero.blogspot.com/2011/03/new-paper-image-zooming-based-on.html' title='New paper: &quot;Image zooming based on sampling theorems&quot;'/><author><name>Alfonso E.</name><uri>http://www.blogger.com/profile/18003152267896724194</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_I_EljbfWDNQ/SLa8uB8XfCI/AAAAAAAAAMw/eDpXIrUTWO4/S220/reich.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-986368889047854119.post-3631261906528526546</id><published>2011-01-01T00:43:00.000+01:00</published><updated>2011-01-01T00:43:10.028+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='myself'/><title type='text'>Happy 2011!</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;b&gt;2010 was&lt;/b&gt; &lt;b&gt;a great year for me&lt;/b&gt;, in professional terms. Mainly, I achieved &lt;b&gt;two important goals&lt;/b&gt; for my career:&lt;/div&gt;&lt;br /&gt;&lt;ol&gt;&lt;li style="text-align: justify;"&gt;In April, &lt;b&gt;I read my &lt;a href="http://decsai.ugr.es/~aeromero/files/thesis.pdf"&gt;thesis&lt;/a&gt;&lt;/b&gt;, and therefore I got my PhD.&lt;/li&gt;&lt;li style="text-align: justify;"&gt;In September, I started a new job as a postdoc at the &lt;a href="http://cs.rhul.ac.uk/"&gt;Computer Science Department&lt;/a&gt; in the &lt;a href="http://www.rhul.ac.uk/"&gt;Royal Holloway, University of London&lt;/a&gt;, under the supervission of &lt;a href="http://www.cs.rhul.ac.uk/~alberto/"&gt;Dr. Alberto Paccanaro&lt;/a&gt;.&lt;/li&gt;&lt;/ol&gt;&lt;div style="text-align: justify;"&gt;On the other hand, I released my &lt;b&gt;first contribution&lt;/b&gt; to the free software community, &lt;b&gt;&lt;a href="http://daurolab.blogspot.com/"&gt;DauroLab&lt;/a&gt;&lt;/b&gt;, a Java library for doing Large Scale Machine Learning (still not very mature). I plan to improve it monthly during this new year, with &lt;i&gt;clear and concise&lt;/i&gt; objectives.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Besides, in my new &lt;a href="http://paccanarolab.org/"&gt;research group&lt;/a&gt;, I started working in &lt;b&gt;bioinformatics&lt;/b&gt;. This is a new research area for me, plenty of promising and exciting problems, and though I am still learning a lot the basics of this field of science, I will surely publish some paper on it very soon.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Also, 2010 was a great year for many other reasons. But indeed, the most important one is&amp;nbsp;&lt;b&gt;the people I met during it&lt;/b&gt;.&lt;b&gt;&amp;nbsp;&lt;/b&gt;Thank you everybody for your support, for your trust, and for being there in the &lt;i&gt;not-so-good&lt;/i&gt; moments.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/986368889047854119-3631261906528526546?l=alfonsoeromero.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alfonsoeromero.blogspot.com/feeds/3631261906528526546/comments/default' title='Enviar comentarios'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=986368889047854119&amp;postID=3631261906528526546' title='0 comentarios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/3631261906528526546'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/3631261906528526546'/><link rel='alternate' type='text/html' href='http://alfonsoeromero.blogspot.com/2011/01/happy-2011.html' title='Happy 2011!'/><author><name>Alfonso E.</name><uri>http://www.blogger.com/profile/18003152267896724194</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_I_EljbfWDNQ/SLa8uB8XfCI/AAAAAAAAAMw/eDpXIrUTWO4/S220/reich.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-986368889047854119.post-7661409476796340374</id><published>2010-09-09T15:56:00.001+02:00</published><updated>2011-01-29T16:01:21.956+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='postdoc'/><category scheme='http://www.blogger.com/atom/ns#' term='work'/><title type='text'>Got a Postdoc!</title><content type='html'>&lt;div style="text-align: justify;"&gt;Sorry for not updating since so long. Anyway I got &lt;b&gt;some exciting news&lt;/b&gt;!&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;Since September 1 I am a postdoctoral research assistant&lt;/b&gt; at the &lt;a href="http://paccanarolab.org/"&gt;Computational Biology group&lt;/a&gt; of the &lt;a href="http://www.cs.rhul.ac.uk/"&gt;Computer Science Department&lt;/a&gt; of the &lt;a href="http://rhul.ac.uk/"&gt;Royal Holloway, University of London&lt;/a&gt;, under the supervision of &lt;a href="http://www.cs.rhul.ac.uk/home/alberto/"&gt;Alberto Paccanaro&lt;/a&gt;. Although it might sound really "biologic", the group searches for solutions to biological problems using machine-learning based models. So, in the end, it is machine learning applied to &lt;i&gt;something&lt;/i&gt; (biology, in this case).&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The position is&amp;nbsp; for &lt;b&gt;1.5 years&lt;/b&gt; (till the end of February, 2012), and it is not renewable. For me it is a great opportunity to start in the world of Computational Biology, in a high-level group. Also, I would like to keep studying Machine Learning, and trying to develop some work in a more "pure" and "formal" line (but this is not a real priority).&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The University seems to have a really nice working atmosphere, and all my colleagues are fantastic. Also, the department is small, but counts with &lt;b&gt;great figures in Computer Science&lt;/b&gt; (Vladimir Vapnik since his retirement this year, Alexey Chervonenkis or Glenn Shafer, among others). Also, the fact that is not situated in the center of London makes the campus a quiet and peaceful zone (perfect for thinking and working).&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;I think it is going to be one of the most important periods of my research career (and probably my life).&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/986368889047854119-7661409476796340374?l=alfonsoeromero.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alfonsoeromero.blogspot.com/feeds/7661409476796340374/comments/default' title='Enviar comentarios'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=986368889047854119&amp;postID=7661409476796340374' title='0 comentarios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/7661409476796340374'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/7661409476796340374'/><link rel='alternate' type='text/html' href='http://alfonsoeromero.blogspot.com/2010/09/got-postdoc.html' title='Got a Postdoc!'/><author><name>Alfonso E.</name><uri>http://www.blogger.com/profile/18003152267896724194</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_I_EljbfWDNQ/SLa8uB8XfCI/AAAAAAAAAMw/eDpXIrUTWO4/S220/reich.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-986368889047854119.post-4422221848473000981</id><published>2010-05-06T01:22:00.000+02:00</published><updated>2010-05-06T01:22:39.927+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='research'/><title type='text'>Manuscript of the thesis</title><content type='html'>I have uploaded the manuscript (and the slides) of my thesis (entitled &lt;i&gt;Document Classification Models Based on Bayesian Networks&lt;/i&gt;) in the &lt;a href="http://decsai.ugr.es/~aeromero/doku.php?id=publications"&gt;"publications" section&lt;/a&gt; of my webpage (so, if you want to have a look at it, you can). All comments will be welcomed.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/986368889047854119-4422221848473000981?l=alfonsoeromero.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alfonsoeromero.blogspot.com/feeds/4422221848473000981/comments/default' title='Enviar comentarios'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=986368889047854119&amp;postID=4422221848473000981' title='0 comentarios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/4422221848473000981'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/4422221848473000981'/><link rel='alternate' type='text/html' href='http://alfonsoeromero.blogspot.com/2010/05/manuscript-of-thesis.html' title='Manuscript of the thesis'/><author><name>Alfonso E.</name><uri>http://www.blogger.com/profile/18003152267896724194</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_I_EljbfWDNQ/SLa8uB8XfCI/AAAAAAAAAMw/eDpXIrUTWO4/S220/reich.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-986368889047854119.post-5077294036071852272</id><published>2010-04-29T00:45:00.001+02:00</published><updated>2010-04-29T00:45:09.516+02:00</updated><title type='text'>PhD Defended</title><content type='html'>&lt;div style="float: right; margin-left: 10px; margin-bottom: 10px;"&gt;&lt;a href="http://www.flickr.com/photos/alfonsoeromero/4561820996/" title="photo sharing"&gt;&lt;img src="http://farm4.static.flickr.com/3292/4561820996_178966f5de_m.jpg" alt="" style="border: solid 2px #000000;" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-size: 0.9em; margin-top: 0px;"&gt;&lt;a href="http://www.flickr.com/photos/alfonsoeromero/4561820996/"&gt;Defending the Thesis&lt;/a&gt;&lt;br /&gt;Originally uploaded by &lt;a href="http://www.flickr.com/people/alfonsoeromero/"&gt;AlfonsoERomero&lt;/a&gt;&lt;/span&gt;&lt;/div&gt;At last! Also, with a "cum laude" (maximum mark) as the result. Now, it is time to look for a postdoc.&lt;br clear="all" /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/986368889047854119-5077294036071852272?l=alfonsoeromero.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alfonsoeromero.blogspot.com/feeds/5077294036071852272/comments/default' title='Enviar comentarios'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=986368889047854119&amp;postID=5077294036071852272' title='0 comentarios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/5077294036071852272'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/5077294036071852272'/><link rel='alternate' type='text/html' href='http://alfonsoeromero.blogspot.com/2010/04/phd-defended.html' title='PhD Defended'/><author><name>Alfonso E.</name><uri>http://www.blogger.com/profile/18003152267896724194</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_I_EljbfWDNQ/SLa8uB8XfCI/AAAAAAAAAMw/eDpXIrUTWO4/S220/reich.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm4.static.flickr.com/3292/4561820996_178966f5de_t.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-986368889047854119.post-8766096306478688163</id><published>2010-04-05T18:01:00.005+02:00</published><updated>2010-04-29T00:39:20.849+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='postdoc'/><category scheme='http://www.blogger.com/atom/ns#' term='work'/><title type='text'>Looking for a postdoc</title><content type='html'>&lt;b&gt;On April &lt;s&gt;20&lt;/s&gt; 27*&lt;/b&gt; I will defend my PhD thesis entitled &lt;b&gt;"Document Classification Models based on Bayesian Networks"&lt;/b&gt;. It has been a long way, and if everything runs normally, I will be a doctor by the end of this month.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In order to improve and open my scientific interests, I have decided to go for &lt;b&gt;a postdoc&lt;/b&gt; in Europe. My current research interests are Document Classification, Information Retrieval and Bayesian networks. Anyway, I am interested in all approaches and applications in &lt;b&gt;Machine Learning&lt;/b&gt; (not neccesarily documents), but in fact I am opened to any research topic containing a strong theoretical support. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I have a degree (&lt;i&gt;Ingeniería&lt;/i&gt;, like BEng + MSc) in Computer Science, and (will have) a PhD in Computer Science, and I think I have a decent list of publications (&lt;a href="http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/r/Romero:Alfonso_E=.html"&gt;here&lt;/a&gt; are some of them).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So, if you hear of some interesting offer (or if you maybe can give me one), I will be grateful to listen to it. My email: alfonsoeromero (AT) gmail (DOT) com.&lt;/div&gt;&lt;div&gt;_____________________&lt;/div&gt;&lt;div&gt;* Finally, I had some problems due to the famous &lt;a href="http://en.wikipedia.org/wiki/Eyjafjallaj%C3%B6kull"&gt;volcano&lt;/a&gt;, and the defense was &lt;b&gt;on April 27&lt;/b&gt;.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/986368889047854119-8766096306478688163?l=alfonsoeromero.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alfonsoeromero.blogspot.com/feeds/8766096306478688163/comments/default' title='Enviar comentarios'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=986368889047854119&amp;postID=8766096306478688163' title='0 comentarios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/8766096306478688163'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/8766096306478688163'/><link rel='alternate' type='text/html' href='http://alfonsoeromero.blogspot.com/2010/04/looking-for-postdoc.html' title='Looking for a postdoc'/><author><name>Alfonso E.</name><uri>http://www.blogger.com/profile/18003152267896724194</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_I_EljbfWDNQ/SLa8uB8XfCI/AAAAAAAAAMw/eDpXIrUTWO4/S220/reich.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-986368889047854119.post-2052280846121556802</id><published>2009-01-20T18:40:00.003+01:00</published><updated>2009-01-20T19:03:38.399+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='python'/><category scheme='http://www.blogger.com/atom/ns#' term='programming'/><title type='text'>Today it is a good day to start learning Python</title><content type='html'>&lt;div style="text-align: justify;"&gt;I just was trying to solve a problem of &lt;a href="http://eldiegoj.blogspot.com"&gt;my brother&lt;/a&gt;, and I found Python really exciting. Of course, it doesn't seem to be as good as Perl for regular expressions, but instead, seems that modularization is very clean, and it also has a very large set of libraries.&lt;br /&gt;&lt;br /&gt;Of course, I am not planning moving into Python, but I will keep an eye opened for that language. It could be a nice choice when building a fast prototype :).&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/986368889047854119-2052280846121556802?l=alfonsoeromero.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alfonsoeromero.blogspot.com/feeds/2052280846121556802/comments/default' title='Enviar comentarios'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=986368889047854119&amp;postID=2052280846121556802' title='2 comentarios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/2052280846121556802'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/2052280846121556802'/><link rel='alternate' type='text/html' href='http://alfonsoeromero.blogspot.com/2009/01/today-it-is-good-day-to-start-learning.html' title='Today it is a good day to start learning Python'/><author><name>Alfonso E.</name><uri>http://www.blogger.com/profile/18003152267896724194</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_I_EljbfWDNQ/SLa8uB8XfCI/AAAAAAAAAMw/eDpXIrUTWO4/S220/reich.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-986368889047854119.post-7699437531962978460</id><published>2009-01-03T17:38:00.002+01:00</published><updated>2009-01-03T17:44:58.742+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='software'/><category scheme='http://www.blogger.com/atom/ns#' term='svm'/><title type='text'>Liblinear is amazingly fast!</title><content type='html'>I decided to give up using LibSVM for the linear case because it was not optimized for that. Then, I had a look at &lt;a href="http://www.csie.ntu.edu.tw/%7Ecjlin/liblinear/"&gt;liblinear&lt;/a&gt;, developed on the same team than LivSVM. Liblinear is recommended for document classification because it removes lots of unuseful operations for the linear case. It has also a (very recent) port to Java, which is located &lt;a href="http://www.bwaldvogel.de/liblinear-java/"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Now, working with 50 categories and 0.5GB of data takes only less than 10 seconds on a Core 2 Duo 2GHz laptop. Those timings are impressive! The interface for this library is very similar to the LibSVM one, so it is very easy to migrate from one library to other. Of course, if you made a good design, all you have to do is to update/change your corresponding &lt;a href="http://en.wikipedia.org/wiki/Facade_pattern"&gt;facade class&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;BTW: Happy new year!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/986368889047854119-7699437531962978460?l=alfonsoeromero.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alfonsoeromero.blogspot.com/feeds/7699437531962978460/comments/default' title='Enviar comentarios'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=986368889047854119&amp;postID=7699437531962978460' title='2 comentarios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/7699437531962978460'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/7699437531962978460'/><link rel='alternate' type='text/html' href='http://alfonsoeromero.blogspot.com/2009/01/liblinear-is-amazingly-fast.html' title='Liblinear is amazingly fast!'/><author><name>Alfonso E.</name><uri>http://www.blogger.com/profile/18003152267896724194</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_I_EljbfWDNQ/SLa8uB8XfCI/AAAAAAAAAMw/eDpXIrUTWO4/S220/reich.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-986368889047854119.post-6648681658922528956</id><published>2008-12-25T16:41:00.003+01:00</published><updated>2008-12-25T17:26:32.281+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='networks'/><category scheme='http://www.blogger.com/atom/ns#' term='science'/><title type='text'>Managing your informative networks</title><content type='html'>One of the main sources of "fresh information" in science is the email. More precisely, mailing lists. Depending on your area of expertise, you can find several email lists, where is possible to searching for:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Interesting &lt;span style="font-weight: bold;"&gt;scientific discussions&lt;/span&gt; (like &lt;span style="font-style: italic;"&gt;where can I find the first reference of algorithm X? is there a better way to do this? where is the best source of this topic?&lt;/span&gt;).&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Postdoctoral positions:&lt;/span&gt; jobs that you can take if you already got a PhD and you want to make a lot of &lt;s&gt;money&lt;/s&gt;new exciting research for a short period (2 or 3 years).&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Special numbers of scientific journals:&lt;/span&gt; maybe the opportunity to publish a paper in a highly recognized journal in your area, or at least, getting a good feedback on your research.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Conferences:&lt;/span&gt; the main source of getting information about new or classic conferences, specially the important dates and the list of interesting topics.&lt;/li&gt;&lt;/ul&gt;As a researcher between the fields of Information Retrieval and Machine Learning, I used to read the &lt;a href="http://tech.groups.yahoo.com/group/webir/"&gt;webir list&lt;/a&gt; for having the last news on the IR field. Unfortunately, &lt;a href="http://einat.webir.org/"&gt;Einat Amitay&lt;/a&gt; stop managing the list after 10 years of being there (the list is now closed). Recently, following the advice of &lt;a href="http://decsai.ugr.es/%7Elci"&gt;my supervisor&lt;/a&gt;, I joined the list &lt;a href="http://groups.google.com/group/ML-news?hl=en"&gt;ML-news&lt;/a&gt; which seems to be a very active forum on the area of Machine Learning, but I am still looking for a substitute for webir... Any good suggestion for a mailing list in the field of IR?&lt;br /&gt;&lt;br /&gt;On the other hand, what do you think of mailing lists? What lists do you belong to? Do you think email is very 90-ish? Do you trust more in facebook groups? Are you running a Machine Learning twitter account? Feel free to answer, please.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/986368889047854119-6648681658922528956?l=alfonsoeromero.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alfonsoeromero.blogspot.com/feeds/6648681658922528956/comments/default' title='Enviar comentarios'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=986368889047854119&amp;postID=6648681658922528956' title='1 comentarios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/6648681658922528956'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/6648681658922528956'/><link rel='alternate' type='text/html' href='http://alfonsoeromero.blogspot.com/2008/12/managing-your-informative-networks.html' title='Managing your informative networks'/><author><name>Alfonso E.</name><uri>http://www.blogger.com/profile/18003152267896724194</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_I_EljbfWDNQ/SLa8uB8XfCI/AAAAAAAAAMw/eDpXIrUTWO4/S220/reich.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-986368889047854119.post-3193402394283994902</id><published>2008-12-10T16:07:00.005+01:00</published><updated>2008-12-11T18:22:19.077+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='programming'/><title type='text'>Temporal files: let the OS manage them</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://hostit1.connectria.com/philrandolph/philrandolph.nsf/dx/programmer.jpg/$file/programmer.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 277px; height: 300px;" src="http://hostit1.connectria.com/philrandolph/philrandolph.nsf/dx/programmer.jpg/$file/programmer.jpg" alt="" border="0" /&gt;&lt;/a&gt;During the development of several works in text categorization, I splitted my software in different programs, each of them with a concrete and non-overlapping purpose. The communication among the different (java) programs was coordinated by different &lt;a href="http://perl.org/"&gt;Perl&lt;/a&gt; scripts which made my prototyping faster. The basic "scheme of communication" was then, the following: &lt;span style="font-style: italic;"&gt;the output of a program, and the previous ones is used as input for the current program&lt;/span&gt;. And of course, if some input which was supposed to be there, is not found, my execution was aborted at that point.&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;This kind of "design" can lead us to wrong computations if a certain program of the list, aborts execution, leaving in our hard disk a &lt;span style="font-weight: bold;"&gt;corrupted result file&lt;/span&gt;. Depending on how good are you at "error management" (one of the most-forgotten parts in the word of &lt;span style="font-style: italic;"&gt;software-cycle-development-for-publising-before-that-bloody-deadline&lt;/span&gt;), this will require the previous &lt;span style="font-weight: bold;"&gt;checking of the integrity&lt;/span&gt; of the file. Several time ago, I decided to avoid this problems by using &lt;span style="font-weight: bold;"&gt;temporal files&lt;/span&gt; (let understand "temporal files" as a synonym of "handmade temporal files").&lt;br /&gt;&lt;br /&gt;First of all, if my program should generate for instance a file called "reuters.index", I first wrote a reuters.tmp, and after all the procedure was finished without any problem, the same program renamed the file to the final name. This scheme is as &lt;span style="font-weight: bold;"&gt;dangerous&lt;/span&gt; as the previous solution. Why? Because, again, your program can fail at an intermediate point, and the error will abort current execution (which is good), but could corrupt future executions (because there is already a temporal file on the disk that your program could understand as own). Moreover, this &lt;span style="font-weight: bold;"&gt;removes any possibility for making several parallel executions&lt;/span&gt; of your software (because all would create the same temporal file having the combination of the outputs). This can be easily avoided if you add a &lt;span style="font-weight: bold;"&gt;unique identifier&lt;/span&gt; as the part of the file (like for example the PID of the file). This pid of the file (which in fact is given by the "$$" variable in Perl) makes the name of temporal files different and avoids the previously mentioned problems.&lt;br /&gt;&lt;br /&gt;The only unanswered question that is still left is "what happens with the temporal files of different aborted executions". After my program failed several times, I could find 10 or 12 reuters_XXXX.tmp on my work directory which in fact, were &lt;span style="font-weight: bold;"&gt;several hundred megabytes each one&lt;/span&gt;, and &lt;span style="font-weight: bold;"&gt;filled my working directory&lt;/span&gt; with no more than &lt;span style="font-weight: bold;"&gt;trash.&lt;/span&gt; I found an "elegant" solution: in my main script, I checked at the beginning the existence of temporal files, and if they were in, it deleted them.&lt;br /&gt;&lt;br /&gt;If you read until this point and you are an experienced programmer you maybe could have thought that I am a novice. In fact this kind of practice is &lt;span style="font-weight: bold;"&gt;reinventing the wheel&lt;/span&gt;. All modern operating systems (Linux, MacOS, Windows) support the creation and management of temporal files, avoiding the problems commented before, and probably making a better management than us. Moreover, most modern languages (Java, Perl) include functions to use temporal files as they were "normal files", with only one instruction.&lt;br /&gt;&lt;br /&gt;For instance, &lt;span style="font-weight: bold;"&gt;in java&lt;/span&gt;, we can write:&lt;br /&gt;&lt;br /&gt;File f = File.createTempFile(tempFileName, "tmp");&lt;br /&gt;f.deleteOnExit();&lt;br /&gt;&lt;br /&gt;(obviously the second statement is not compulsory if we do not want to delete the file). After that, we can dump our output to that file, as usual, and finally rename it, or even open it to read it. The operating system will delete the file when exiting the program, and the name management will be carried out also by it.&lt;br /&gt;&lt;br /&gt;On the other hand, &lt;span style="font-weight: bold;"&gt;in Perl&lt;/span&gt; is as easier as in java (where the "UNLINK=&gt;1" could be used or not):&lt;br /&gt;&lt;br /&gt;my ($fh, $filename) = tempfile(UNLINK =&gt; 1);&lt;br /&gt;&lt;br /&gt;Then, you can use the file handler as usual:&lt;br /&gt;&lt;br /&gt;print $fh "HOLA!\n";&lt;br /&gt;&lt;br /&gt;And do not forget to include the corresponding packages!:&lt;br /&gt;&lt;br /&gt;use File::Temp qw/ tempfile tempdir /;&lt;br /&gt;&lt;br /&gt;The lesson learnt is:&lt;span style="font-weight: bold;"&gt; &lt;/span&gt;&lt;span style="font-style: italic;"&gt;&lt;span style="font-weight: bold;"&gt;let the OS be the OS, and you be the scientific.&lt;/span&gt; You will write less code, and your error probability will then, be lower.&lt;/span&gt; Other way of looking at it could be "use all the power provided by your OS".&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/986368889047854119-3193402394283994902?l=alfonsoeromero.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alfonsoeromero.blogspot.com/feeds/3193402394283994902/comments/default' title='Enviar comentarios'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=986368889047854119&amp;postID=3193402394283994902' title='0 comentarios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/3193402394283994902'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/3193402394283994902'/><link rel='alternate' type='text/html' href='http://alfonsoeromero.blogspot.com/2008/12/temporal-files-let-os-manage-them.html' title='Temporal files: let the OS manage them'/><author><name>Alfonso E.</name><uri>http://www.blogger.com/profile/18003152267896724194</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_I_EljbfWDNQ/SLa8uB8XfCI/AAAAAAAAAMw/eDpXIrUTWO4/S220/reich.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-986368889047854119.post-4728465942885140565</id><published>2008-12-06T02:50:00.005+01:00</published><updated>2008-12-06T03:34:54.412+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='programming'/><title type='text'>Make your Lexicon array-free!</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;span style="font-style: italic;"&gt;The following short note is a very programming-oriented one. I apologize to my more-theoretical reader.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In many &lt;a href="http://en.wikipedia.org/wiki/Information%20Retrieval"&gt;IR&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/Text%20Categorization"&gt;TC&lt;/a&gt; applications, a common data structure to make fast access to indexed document is the &lt;span style="font-weight: bold;"&gt;Lexicon&lt;/span&gt;. A Lexicon is a set of terms (whatever it will be a "term"), which are easily accessible by identifier or by string. The access by identifier has the clear advantage that it can be done in constant time, while the access by string should visit a binary (&lt;a href="http://en.wikipedia.org/wiki/Red-black_tree"&gt;red-black&lt;/a&gt;?) tree, doing, in the best case a logarithmic-time access (with a certain overhead, because the string should be compared against the strings of all nodes transversed). If we are given the class &lt;span style="font-style: italic;"&gt;Term&lt;/span&gt;, a common procedure to implement our lexicon could be the following (the implementation is given in Java, but it could be easily translated to C++ or C#):&lt;br /&gt;&lt;br /&gt;class Lexicon {&lt;br /&gt; private Map&amp;lt;String, Integer &amp;gt;&lt;integer,&gt;&lt;string,&gt; identifierByString;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;  private Term terms[];&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;  ...&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;  public Term getTermById(int i) { return terms[i]; }&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;  public int size() { return terms.length; }&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;}&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;implementation which, given the identifier &lt;span style="font-style: italic;"&gt;i&lt;/span&gt; of a term, makes easy (just by doing &lt;span style="font-style: italic;"&gt;terms[i]&lt;/span&gt;) the access to the &lt;span style="font-style: italic;"&gt;i&lt;/span&gt;-th term. This is, in my opinion, an &lt;span style="font-weight: bold;"&gt;error&lt;/span&gt; that subtracts flexibility to your system. The three main drawbacks of the above design are the following:&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;ul&gt;&lt;li&gt;It assumes that the system has "size()" terms, &lt;span style="font-weight: bold;"&gt;with identifiers going from 0 to size()-1&lt;/span&gt;.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;It does not allow the removal of a term&lt;/span&gt; (for example, if we are doing &lt;span style="font-style: italic;"&gt;term selection&lt;/span&gt;, a very common procedure in TC, we have to build a second lexicon, a "translation table", and translate the indexed documents to the new lexicon, in order to avoid "unused" term identifiers).&lt;/li&gt;&lt;li&gt;The way to "iterate" the whole set of terms is very weak and implementation dependent:&lt;/li&gt;&lt;/ul&gt;  for(int i=0; i&amp;lt;lexicon.size; i++)&lt;lexicon.size();&gt;&lt;br /&gt;&lt;string,&gt;  {&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;    Term t = lexicon.getTermById(i);&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;    ...&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;  }&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;Of course the first one could not be true in some preprocessed datasets (the set of terms could be starting from 1, and it could have some "holes" in it, due to a previous term selection).&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;I propose the following and more flexible approach:&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;class Lexicon {&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;  private Map&lt;string,&gt;&lt;string,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;/lexicon.size();&gt;&lt;/string,&gt;&lt;/integer,&gt;&amp;lt;String, Integer&amp;gt;&lt;integer,&gt;&lt;string,&gt;&lt;lexicon.size();&gt;&lt;string,&gt;&lt;string,&gt;&lt;string,&gt; identifierByString;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;  private HashMap&lt;integer,&gt;&lt;integer,&gt;&lt;/integer,&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;/lexicon.size();&gt;&lt;/string,&gt;&lt;/integer,&gt;&amp;lt;Integer, Term&amp;gt;&lt;integer,&gt;&lt;string,&gt;&lt;lexicon.size();&gt;&lt;string,&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer,&gt; terms;&lt;/integer,&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;  ...&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;  public Term getTermById(int i) { return terms.get(i); }&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;  public Set&lt;integer&gt; getTermIdentifiers() { return terms.keySet(); }&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;  public int size() { return terms.size(); }&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;}&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;The array has been substituted by a map (a HashMap, if possible), which also gives us constant access time (with of course, a bit of cpu overhead, and additional memory usage).&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;Note the new introduced method getTermIdentifiers(), whose objective is to give us a substitute to the "weak" way to iterate the lexicon. Now, by doing&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;  for(int term : lexicon.getTermIdentifiers())&lt;br /&gt; {&lt;br /&gt;   Term t = lexicon.getTermById(i);&lt;br /&gt;   ...&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;using the &lt;a href="http://java.sun.com/j2se/1.5.0/docs/guide/language/foreach.html"&gt;foreach java 5 loop&lt;/a&gt; we are sure we are not making any mistakes (like for instance using a non valid term id, causing the subsequent NullPointerException). Note that this version does not guarantee the transversal of the set of terms in increasing order. Never mind, because &lt;span style="font-weight: bold;"&gt;getting terms necessarily by increasing id should also be avoided&lt;/span&gt; (because the id assigned to a term is often based on the place a document was processed, being consequently an arbitrary fact). Remember: &lt;span style="font-weight: bold;"&gt;ignore underlying structures&lt;/span&gt; and program to an interface.&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;Now, we can write a simple "removeTerm" method in just two lines (remember being coherent and remove the term in both places!):&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;  void removeTerm(int i)&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;  {&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;     identifierByString.remove(termById.get(i).getString());&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;     termById.remove(i);&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;  }&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;As a conclusion, sometimes is better to sacrifice time and memory if your design is stronger and does not imply a deep knowledge of the underlying structure. In my opinion,&lt;span style="font-weight: bold;"&gt; arrays should always be avoided unless you are pretty sure (at 100%) that your data is an application of 0..n-1 to a certain set&lt;/span&gt;. A good starting strategy, as a &lt;span style="font-style: italic;"&gt;"rule of thumb"&lt;/span&gt; can be the following: &lt;span style="font-weight: bold;"&gt;"if the contens are likely to change, be array-free&lt;/span&gt;&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;span style="font-weight: bold;"&gt;, encapsulate containers, not arrays"&lt;/span&gt;.&lt;br /&gt;&lt;string,&gt;&lt;string,&gt;&lt;integer,&gt;&lt;integer&gt;&lt;/integer&gt;&lt;/integer,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;/string,&gt;&lt;/lexicon.size();&gt;&lt;/string,&gt;&lt;/integer,&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/986368889047854119-4728465942885140565?l=alfonsoeromero.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alfonsoeromero.blogspot.com/feeds/4728465942885140565/comments/default' title='Enviar comentarios'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=986368889047854119&amp;postID=4728465942885140565' title='0 comentarios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/4728465942885140565'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/4728465942885140565'/><link rel='alternate' type='text/html' href='http://alfonsoeromero.blogspot.com/2008/12/make-your-lexicon-array-free.html' title='Make your Lexicon array-free!'/><author><name>Alfonso E.</name><uri>http://www.blogger.com/profile/18003152267896724194</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_I_EljbfWDNQ/SLa8uB8XfCI/AAAAAAAAAMw/eDpXIrUTWO4/S220/reich.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-986368889047854119.post-5325002603694668632</id><published>2008-12-05T00:49:00.000+01:00</published><updated>2008-12-05T02:06:27.732+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='text classification'/><category scheme='http://www.blogger.com/atom/ns#' term='svm'/><title type='text'>LibSVM integrated</title><content type='html'>&lt;div style="text-align: justify;"&gt;I finally finished integration of LibSVM in my software. Trying to reproduce &lt;a href="http://joachims.org"&gt;Joachims'&lt;/a&gt; results on &lt;a href="http://www.daviddlewis.com/resources/testcollections/reuters21578/"&gt;reuters&lt;/a&gt; and &lt;a href="http://trec.nist.gov/data/t9_filtering.html"&gt;ohsumed-23&lt;/a&gt; I got the following on the &lt;a href="http://datamin.ubbcluj.ro/wiki/index.php/Evaluation_methods_in_text_categorization"&gt;micro-averaged breakeven point&lt;/a&gt;:&lt;br /&gt;&lt;/div&gt;&lt;ul style="text-align: justify;"&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;On reuters:&lt;/span&gt; 84.2 (Joachims), 85.9 (me).&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;On ohsumed:&lt;/span&gt; 60.7 (Joachims), 64.8 (me).&lt;/li&gt;&lt;/ul&gt;(the reference paper is &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.21.8039"&gt;this&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;The differences can be due to the difference on the stopword list (I used the famous &lt;a href="http://www.lextek.com/manuals/onix/stopwords2.html"&gt;571 words&lt;/a&gt; of the SMART system which is almost a standard) and my own processing procedure (I remove &lt;span style="font-weight: bold;"&gt;all&lt;/span&gt; punctuation marks). Indeed the results are really good, but the great difference in ohsumed is mysterious...&lt;br /&gt;&lt;br /&gt;By the way, training time in my Core 2 Duo 2Ghz, for LibSVM is 4m28s, and classification 1m42s. It is the Java version, but it is still affordable. &lt;span style="font-style: italic;"&gt;Who said SVMs were slow?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;On the following days, I will try to improve my k-NN implementation (at this time, it has no &lt;a href="http://en.wikipedia.org/wiki/Inverted_index"&gt;inverted index&lt;/a&gt;, and so is terrifyingly slow), and to include another Bayesian network classifier (Sahami's "limited dependence bayesian classifier"), which I think could be improved in some way to make it competitive with SVMs.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/986368889047854119-5325002603694668632?l=alfonsoeromero.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alfonsoeromero.blogspot.com/feeds/5325002603694668632/comments/default' title='Enviar comentarios'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=986368889047854119&amp;postID=5325002603694668632' title='4 comentarios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/5325002603694668632'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/5325002603694668632'/><link rel='alternate' type='text/html' href='http://alfonsoeromero.blogspot.com/2008/12/libsvm-integrated.html' title='LibSVM integrated'/><author><name>Alfonso E.</name><uri>http://www.blogger.com/profile/18003152267896724194</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_I_EljbfWDNQ/SLa8uB8XfCI/AAAAAAAAAMw/eDpXIrUTWO4/S220/reich.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-986368889047854119.post-7046513654811267558</id><published>2008-12-04T14:13:00.001+01:00</published><updated>2008-12-05T19:00:46.067+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='software'/><category scheme='http://www.blogger.com/atom/ns#' term='svm'/><title type='text'>Novice problems with LibSVM</title><content type='html'>&lt;div style="text-align: justify;"&gt;If you are dealing with LibSVM, you mus remember the following:&lt;br /&gt;&lt;/div&gt;&lt;ul style="text-align: justify;"&gt;&lt;li&gt;When building sparse vectors using datatype svm_node, be careful with allocating keys in &lt;span style="font-weight: bold;"&gt;ascending order&lt;/span&gt;. This is clearly specified in the documentation, but sometimes we are too lazy to read it before.&lt;/li&gt;&lt;li&gt;By default, the outputs of LibSVM, when doing classification are one of &lt;span style="font-weight: bold;"&gt;{-1,1}&lt;/span&gt;. So, do not wait to get real outputs (for instance, distance to the hyperplane), unless you hack the code yourself. If you are doing text categorization, this is good to measure (macro/micro) F1, but not to get a good accuracy.&lt;/li&gt;&lt;li&gt;You must first &lt;span style="font-weight: bold;"&gt;preprocess your feature vectors!&lt;/span&gt; Joachims proposes using a tf * idf, followed by a L2 normalization (classical Euclidean norm). This is valid for text classification, translating every coordinate value to the interval [0,1]. Other normalization schemes are valid for "classic" classification problems like iris and so (in those cases, the different atributes are scaled independently to [0,1]).&lt;/li&gt;&lt;li&gt;There is a nasty bug (lack of feature?) in the Java version, at the method "svm_save_model", that makes very slow that procedure, because the output is not buffered. To solve it, find this line:&lt;br /&gt;&lt;tt&gt;DataOutputStream fp = new DataOutputStream(new FileOutputStream(model_file_name));&lt;/tt&gt;&lt;br /&gt;And change it by the following:&lt;br /&gt;&lt;tt&gt;DataOutputStream fp = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(model_file_name)));&lt;/tt&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/986368889047854119-7046513654811267558?l=alfonsoeromero.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alfonsoeromero.blogspot.com/feeds/7046513654811267558/comments/default' title='Enviar comentarios'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=986368889047854119&amp;postID=7046513654811267558' title='3 comentarios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/7046513654811267558'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/7046513654811267558'/><link rel='alternate' type='text/html' href='http://alfonsoeromero.blogspot.com/2008/12/novice-problems-with-libsvm.html' title='Novice problems with LibSVM'/><author><name>Alfonso E.</name><uri>http://www.blogger.com/profile/18003152267896724194</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_I_EljbfWDNQ/SLa8uB8XfCI/AAAAAAAAAMw/eDpXIrUTWO4/S220/reich.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-986368889047854119.post-4916784297200215059</id><published>2008-12-03T15:29:00.000+01:00</published><updated>2008-12-03T15:51:06.722+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='software'/><category scheme='http://www.blogger.com/atom/ns#' term='svm'/><title type='text'>Using a SVM library for text categorization</title><content type='html'>&lt;div style="text-align: justify;"&gt;In &lt;span style="font-weight: bold;"&gt;text categorization&lt;/span&gt;, one of my research fields, is impossible to ignore the great power of &lt;a href="http://en.wikipedia.org/wiki/Support_Vector_Machine"&gt;Support Vector Machines&lt;/a&gt; (SVM). Almost everyone agrees that, even in its more primitive form (Linear SVM), they are the &lt;span style="font-style: italic;"&gt;killer&lt;/span&gt; algorithm to do this task (the one that gets more accuracy). And of course, that implies that any new presented approach to solve this problem &lt;span style="font-weight: bold;"&gt;should be tested against SVMs&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;But, ¿what (open source/free) software packages are available for Support Vector Machines? In fact, the list is very reduced, being the two most popular ones the following:&lt;br /&gt;&lt;/div&gt;&lt;ul style="text-align: justify;"&gt;&lt;li&gt;&lt;a href="http://svmlight.joachims.org/"&gt;SVM&lt;i&gt;&lt;sup&gt;light&lt;/sup&gt;&lt;/i&gt;&lt;/a&gt;, a C implementation written by &lt;a href="http://www.cs.cornell.edu/People/tj/"&gt;Thorsten Joachims&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.csie.ntu.edu.tw/%7Ecjlin/libsvm/"&gt;LibSVM&lt;/a&gt;, a C (and Java) implementation written by &lt;a href="http://www.csie.ntu.edu.tw/%7Ecjlin/index.html"&gt;Chih-Jen Lin&lt;/a&gt;.  &lt;/li&gt;&lt;/ul&gt;&lt;div style="text-align: justify;"&gt;Both of them are suitable for tasks of text categorization, as they are lightweight implementations, and relatively fasts. Besides, they both include Platt's &lt;a href="http://research.microsoft.com/%7Ejplatt/smo.html"&gt;SMO algorithm&lt;/a&gt; in order to make training procedure faster. So, ¿which one can be chosen?&lt;br /&gt;&lt;br /&gt;I have tested both of them. In &lt;a href="http://dx.doi.org/10.1016/j.ijar.2008.10.006"&gt;my most recent paper&lt;/a&gt;, we have used SVM&lt;i&gt;&lt;sup&gt;light&lt;/sup&gt;&lt;/i&gt; to make a comparison against a Bayesian Network model to classify in a thesaurus environment (and of course, we beat linear SVMs!). That package is &lt;span style="font-weight: bold;"&gt;amazingly fast&lt;/span&gt;, not only due to the language it is written in (C), but due to the great job of Joachims in doing heuristics and other tricks.&lt;br /&gt;&lt;br /&gt;In this moment, I am using LibSVM in my Java environment for text categorization (which I expect to release soon as free software), using directly the Java implementation. Althought it is written in a very "C-style" (arrays instead of containers, static methods, no exceptions,...), it is not so bad at speed (obviously it is several times slower than SVM&lt;i&gt;&lt;sup&gt;light&lt;/sup&gt;&lt;/i&gt;, but it is Java, avoiding linking with a non portable library, and keeping the entire system in one language).&lt;br /&gt;&lt;br /&gt;From the point of view of &lt;span style="font-weight: bold;"&gt;software licenses&lt;/span&gt;, LibSVM is released under the &lt;a href="http://en.wikipedia.org/wiki/BSD_licenses"&gt;modified BSD license&lt;/a&gt; (a GPL compatible license). This is good, because it allows yo to use this package, even in a non free software environment (I must admit the last point is not really so good). &lt;span style="font-weight: bold;"&gt; SVM&lt;/span&gt;&lt;i&gt;&lt;sup&gt;&lt;span style="font-weight: bold;"&gt;light&lt;/span&gt; &lt;/sup&gt;&lt;/i&gt;, on the other hand, &lt;span style="font-weight: bold;"&gt;is not free software&lt;/span&gt;. The license note claims that:&lt;br /&gt;&lt;blockquote style="font-style: italic;"&gt;&lt;p&gt;The program is free for scientific use. Please contact me, if you are planning to use the software for commercial purposes. The software must not be further distributed without prior permission of the author.&lt;sup&gt;&lt;/sup&gt;  &lt;/p&gt;  &lt;/blockquote&gt;This is an important fact for me, and that is why &lt;span style="font-weight: bold;"&gt;I prefer LibSVM&lt;/span&gt;. I must admit that Joachims' work is impressive, and SVM&lt;i&gt;&lt;sup&gt;light &lt;/sup&gt;&lt;/i&gt;is probably a faster and more complete environment, but I can cope with the lack of functionality and speed of &lt;span style="font-weight: bold;"&gt;LibSVM, because it is free software&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;What do you think about this?&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/986368889047854119-4916784297200215059?l=alfonsoeromero.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alfonsoeromero.blogspot.com/feeds/4916784297200215059/comments/default' title='Enviar comentarios'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=986368889047854119&amp;postID=4916784297200215059' title='0 comentarios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/4916784297200215059'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/4916784297200215059'/><link rel='alternate' type='text/html' href='http://alfonsoeromero.blogspot.com/2008/12/using-svm-library-for-text.html' title='Using a SVM library for text categorization'/><author><name>Alfonso E.</name><uri>http://www.blogger.com/profile/18003152267896724194</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_I_EljbfWDNQ/SLa8uB8XfCI/AAAAAAAAAMw/eDpXIrUTWO4/S220/reich.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-986368889047854119.post-8691964767967241602</id><published>2008-12-02T15:00:00.000+01:00</published><updated>2008-12-02T15:20:17.904+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='science'/><title type='text'>H-Index and so</title><content type='html'>&lt;div style="text-align: justify;"&gt;The &lt;a href="http://en.wikipedia.org/wiki/H%20Index"&gt;Hirsch Index&lt;/a&gt; (or in a sort way, the h-Index) is a way to measure scientific popularity by one number. A scientific with an h-Index of &lt;span style="font-style: italic;"&gt;n&lt;/span&gt;, will have &lt;span style="font-style: italic;"&gt;n &lt;/span&gt;papers with at least &lt;span style="font-style: italic;"&gt;n&lt;/span&gt; citations each one. Note that I am saying that it can measure "scientific popularity", neither "scientific excelence" nor "scientific quality" (although there is often a correlation between those three facts).&lt;br /&gt;&lt;br /&gt;h-index is sometimes a way to &lt;span style="font-style: italic;"&gt;self-glorification&lt;/span&gt;, other times it hides a collaborative &lt;span style="font-style: italic;"&gt;mafia&lt;/span&gt; ("I will cite you if you cite me"), but I like it (and so scientific community do). It is only a &lt;span style="font-style: italic;"&gt;metric&lt;/span&gt;, but, as in other metrics, is a &lt;span style="font-style: italic;"&gt;quantitative&lt;/span&gt; way to measure the importance of a scientific in its community.&lt;br /&gt;&lt;br /&gt;As you can see, h-index grows in an &lt;span style="font-weight: bold;"&gt;exponential manner&lt;/span&gt;: when you get your first citation, you get your &lt;span style="font-style: italic;"&gt;h=1&lt;/span&gt;. To get h=2 you need, either to get (al least) one more citation on that paper, and (at least) two more in a different one, or get (at least) two citations in different papers. That means that, stepping from an h-index of &lt;span style="font-style: italic;"&gt;n&lt;/span&gt;-2 to &lt;span style="font-style: italic;"&gt;n-1 &lt;/span&gt;is quite easy than doing the same from &lt;span style="font-style: italic;"&gt;n-1&lt;/span&gt; to &lt;span style="font-style: italic;"&gt;n&lt;/span&gt; (because in every step you need more and more citations).&lt;br /&gt;&lt;br /&gt;A tool which is helpful to compute this index is "Publish or Perish", available &lt;a href="http://www.harzing.com/pop.htm"&gt;here&lt;/a&gt;. Using &lt;a href="http://scholar.google.com"&gt;google scholar&lt;/a&gt; and other similar services, this tool can easily compute the h-index of a researcher, in a semi-automatic way (you often have to discard manually several publications that are not coming from that author, and of course, self-citations).&lt;br /&gt;&lt;br /&gt;By the way, at the present day (2/12/2008) my h-index is 1. It is not so bad for a PhD student, but I hope it would be improved next year...&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/986368889047854119-8691964767967241602?l=alfonsoeromero.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alfonsoeromero.blogspot.com/feeds/8691964767967241602/comments/default' title='Enviar comentarios'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=986368889047854119&amp;postID=8691964767967241602' title='1 comentarios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/8691964767967241602'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/8691964767967241602'/><link rel='alternate' type='text/html' href='http://alfonsoeromero.blogspot.com/2008/12/h-index-and-so.html' title='H-Index and so'/><author><name>Alfonso E.</name><uri>http://www.blogger.com/profile/18003152267896724194</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_I_EljbfWDNQ/SLa8uB8XfCI/AAAAAAAAAMw/eDpXIrUTWO4/S220/reich.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-986368889047854119.post-2793960189255235524</id><published>2008-12-02T14:32:00.001+01:00</published><updated>2008-12-02T14:44:01.489+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='self.blog'/><title type='text'>What do I understand by "a professional blog"?</title><content type='html'>&lt;div style="text-align: justify;"&gt;Hi everyone!&lt;br /&gt;&lt;br /&gt;As a &lt;a href="http://decsai.ugr.es/%7Eaeromero"&gt;&lt;span style="font-weight: bold;"&gt;researcher&lt;/span&gt;&lt;/a&gt; in the field of &lt;a href="http://en.wikipedia.org/wiki/Computer%20Science"&gt;Computer Science&lt;/a&gt;, I have decided to open a "professional" blog. What I mean by "professional" is that &lt;span style="font-weight: bold;"&gt;all the contents of this blog are going to be exclusively related with my research and current job&lt;/span&gt; (I am a funded student at the &lt;a href="http://www.ugr.es"&gt;University of Granada&lt;/a&gt;, Spain, in the&lt;a href="http://decsai.ugr.es"&gt; Department of Computer Science and Artificial Intelligence&lt;/a&gt;). Then, the posts of &lt;a href="http://alfonsoeromero.blogspot.com"&gt;this blog&lt;/a&gt; could comprise:&lt;br /&gt;&lt;/div&gt;&lt;ul style="text-align: justify;"&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Publications&lt;/span&gt; I have made, with some personal comments.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Conferences&lt;/span&gt; I have attended and/or being a speaker (that could include also some interesting pics!).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Interesting papers&lt;/span&gt; I have read (also with personal comments).&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Software&lt;/span&gt; that I have released, or software that I have found interesting or useful.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Random thoughts&lt;/span&gt; (i.e. opinion) on Computer Science, Information Retrieval, Machine Learning, or science in general.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Interesting links&lt;/span&gt; (coming from &lt;a href="http://reddit.com"&gt;interesting sources&lt;/a&gt;).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Miscellaneous&lt;/span&gt; information (about this blog, or about me, always related with my job).&lt;/li&gt;&lt;/ul&gt;&lt;div style="text-align: justify;"&gt;This is what I understand by a "professional" blog. Other suggestions are also welcomed.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/986368889047854119-2793960189255235524?l=alfonsoeromero.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://alfonsoeromero.blogspot.com/feeds/2793960189255235524/comments/default' title='Enviar comentarios'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=986368889047854119&amp;postID=2793960189255235524' title='0 comentarios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/2793960189255235524'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/986368889047854119/posts/default/2793960189255235524'/><link rel='alternate' type='text/html' href='http://alfonsoeromero.blogspot.com/2008/12/what-do-i-understand-by-professional.html' title='What do I understand by &quot;a professional blog&quot;?'/><author><name>Alfonso E.</name><uri>http://www.blogger.com/profile/18003152267896724194</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_I_EljbfWDNQ/SLa8uB8XfCI/AAAAAAAAAMw/eDpXIrUTWO4/S220/reich.jpg'/></author><thr:total>0</thr:total></entry></feed>
