Showing posts with label information retrieval. Show all posts
Showing posts with label information retrieval. Show all posts

Tuesday, 18 November 2008

Lucene Analyzers

There are the conclusions of differents benchmarks with differents document analyzers and Lucene.

Tested analyzers:
(if you have no idea about Information Retrieval Metrics look here)

For the precision metric the best analyzer was the Standard Analyzer.
For the recall metric the best analyzer was Double Metaphone Analyzer.
For R-Precision metric and K-Precision the best was the Snowball Analyzer with stop words.

If you are developing an information retrival system with Lucene I recommend you to use the Snowball Analyzer with a good set of stop words, this will return to your users better results.

Saturday, 27 September 2008

Phonetix

When you want to get a useful search engine for a set of documents you have to take care with the analyzer that you choose. In Lucene you can use standard analyzers like StandardAnalyzer (=P), more simple like SimpleAnalyzer (2 * =P), more advanced like SnowballAnalyzer or heuristics algorithms like Soundex.

Soundex is an old heuristic developed by Robert Russell and Margaret Odell to match sintactical different words, this is useful when the user make a mistake in the query. Soundex was improved with Metaphone that was developed by Lawrence Philips adding more rules to Soundex, and again he did it better with Double Metaphone.

Nowadays, I'm experimenting with Information Retrieval systems, especifically with Lucene. I'm using it for an university course. I wanted to make experiments with an heuristic phonetic analyzer, but for my surprise there is no analyzer of this style in the standard set of libraries of Lucene. After some minutes with Google I found Phonetix that implements Soundex, Metaphone and Double Metaphone and a wrapper to use it with Lucene. I tried it with Lucene 2.3.2 but I found that this library wasn't updated to this version of Lucene. After some minutes of work I refresh a little the code of Phonetix and it works perfectly with Lucene =D.
I sent the updated code to the owner, I expect that they publish it soon.

About the experiments, using Double Metaphone gives me good results. I will talk more about the experiments in another post.