Saturday, 27 September 2008

Phonetix

When you want to get a useful search engine for a set of documents you have to take care with the analyzer that you choose. In Lucene you can use standard analyzers like StandardAnalyzer (=P), more simple like SimpleAnalyzer (2 * =P), more advanced like SnowballAnalyzer or heuristics algorithms like Soundex.

Soundex is an old heuristic developed by Robert Russell and Margaret Odell to match sintactical different words, this is useful when the user make a mistake in the query. Soundex was improved with Metaphone that was developed by Lawrence Philips adding more rules to Soundex, and again he did it better with Double Metaphone.

Nowadays, I'm experimenting with Information Retrieval systems, especifically with Lucene. I'm using it for an university course. I wanted to make experiments with an heuristic phonetic analyzer, but for my surprise there is no analyzer of this style in the standard set of libraries of Lucene. After some minutes with Google I found Phonetix that implements Soundex, Metaphone and Double Metaphone and a wrapper to use it with Lucene. I tried it with Lucene 2.3.2 but I found that this library wasn't updated to this version of Lucene. After some minutes of work I refresh a little the code of Phonetix and it works perfectly with Lucene =D.
I sent the updated code to the owner, I expect that they publish it soon.

About the experiments, using Double Metaphone gives me good results. I will talk more about the experiments in another post.

2 comments:

Chris said...

Hi Jonathan,

Thanks for the useful post. Would you mind posting or sending me the changes that you made to get Phonetix to work with Lucene 2 as they don't appear to have made these available on the download.

Many thanks,
Chris Peacock.

Jonathan Barbero said...

Hey Chris,

I'm sorry to respond you so late, I moderate comments and I did't realize that somebody comments. I have no problem to send you the changes. Look for me in linkedin or show me how to get your email.

Regards,
Jonathan.