When you want to get a useful search engine for a set of documents you have to take care with the analyzer that you choose. In
Lucene you can use standard analyzers like
StandardAnalyzer (=P), more simple like
SimpleAnalyzer (2 * =P), more advanced like
SnowballAnalyzer or heuristics algorithms like
Soundex.
Soundex is an
old heuristic developed by Robert Russell and Margaret Odell to match
sintactical different words, this is useful when the user make a mistake in the query.
Soundex was improved with
Metaphone that was developed by Lawrence Philips adding more rules to
Soundex, and again he did it better with
Double Metaphone.
Nowadays, I'm experimenting with Information Retrieval systems,
especifically with
Lucene. I'm using it for an university course. I wanted to make experiments with an heuristic phonetic analyzer, but for my surprise there is no analyzer of this style in the standard set of libraries of
Lucene. After some minutes with Google I found
Phonetix that implements
Soundex,
Metaphone and Double
Metaphone and a wrapper to use it with
Lucene. I tried it with
Lucene 2.3.2 but I found that this library wasn't updated to this version of
Lucene. After some minutes of work I refresh a little the code of
Phonetix and it works perfectly with
Lucene =D.
I sent the updated code to the owner, I expect that they publish it soon.
About the experiments, using Double
Metaphone gives me good results. I will talk more about the experiments in another post.