Tuesday 18 November 2008

Lucene Analyzers

There are the conclusions of differents benchmarks with differents document analyzers and Lucene.

Tested analyzers:
(if you have no idea about Information Retrieval Metrics look here)

For the precision metric the best analyzer was the Standard Analyzer.
For the recall metric the best analyzer was Double Metaphone Analyzer.
For R-Precision metric and K-Precision the best was the Snowball Analyzer with stop words.

If you are developing an information retrival system with Lucene I recommend you to use the Snowball Analyzer with a good set of stop words, this will return to your users better results.

Saturday 27 September 2008

Phonetix

When you want to get a useful search engine for a set of documents you have to take care with the analyzer that you choose. In Lucene you can use standard analyzers like StandardAnalyzer (=P), more simple like SimpleAnalyzer (2 * =P), more advanced like SnowballAnalyzer or heuristics algorithms like Soundex.

Soundex is an old heuristic developed by Robert Russell and Margaret Odell to match sintactical different words, this is useful when the user make a mistake in the query. Soundex was improved with Metaphone that was developed by Lawrence Philips adding more rules to Soundex, and again he did it better with Double Metaphone.

Nowadays, I'm experimenting with Information Retrieval systems, especifically with Lucene. I'm using it for an university course. I wanted to make experiments with an heuristic phonetic analyzer, but for my surprise there is no analyzer of this style in the standard set of libraries of Lucene. After some minutes with Google I found Phonetix that implements Soundex, Metaphone and Double Metaphone and a wrapper to use it with Lucene. I tried it with Lucene 2.3.2 but I found that this library wasn't updated to this version of Lucene. After some minutes of work I refresh a little the code of Phonetix and it works perfectly with Lucene =D.
I sent the updated code to the owner, I expect that they publish it soon.

About the experiments, using Double Metaphone gives me good results. I will talk more about the experiments in another post.

Tuesday 2 September 2008

Google Chrome (YAWB)

Yet, Another Web Browser.

Yeah, another big compatibility issue for web based systems.
Are those sounds of web developers crying?

But we shouldn't be pessimistic, Google Chrome was made using WebKit that is the same renderer framework that uses Safari. So if your site looks good in Safari it should look good with Chrome (this will help Safari to has more compatible sites). Chrome uses a sandbox for each tab, so you won't have the problems that IE and Firefox have when a site tab gets busy with javascript execution.
They developed V8, a "new javascript engine". Little time ago, Mozilla people said that with the new realease of Firefox (3.1) they speed ups the javascript execution. I wonder which will be better. This is important because more and more applications are running on the client side over the javascript engine.

I used it a couple of hours. I just find three problems. It freeze some seconds when you open some tabs with You Tube videos and if you try to jump from one tab to another. When you open lot of tabs and you go from one to another Chrome take a little time to reload, maybe it load the page data from hard disk. When you scroll large page with a lot of images it didn't refresh the page so quickly you can't see that. After that, is really a good tool and a great initial release.

PD: Do you remember when Google people said that they won't develop a new web browser?

Monday 25 August 2008

Another blog ...

Yes, another blog. I know, I know ... there are too many blogs. For each topic you can found thousands of blogs talking about everything. Why another blog? Well, sometimes I have an idea and I forgot it in seconds and if i public my ideas, the people´s opinion could help to make them real.

The updates to this blog will be very rare and uncommon ...