Registering a custom analyzer for phonetic search in Neo4j 4

11 March 2021 · 2 min read

Phonetic matching attempts to match words by pronunciation instead of spelling. Words are typically misspelled and exact matches result in them not being found. Algorithms such as Soundex and Metaphone were developed to address this problem and they have found usage in the areas of voice assistants, search, record linking and fraud detection, misspelled names of things (for example, medical records) etc.

Custom analyzers

In 2019, we blogged about creating a Czech analyzer to address accents in the language.

With Neo4j 4, a few things have changed. This short blog post was inspired by a StackOverflow question on phonetic searches and resulted in me discovering what had to change to register an analyzer in Neo4j 4.

First, we create our Phonetic analyzer by extending org.neo4j.graphdb.schema.AnalyzerProvider. The @Service.Implementation annotation has been replaced by just a @ServiceProvider. This particular implementation just uses a DoubleMetaphoneFilter.

@ServiceProvider
public class PhoneticAnalyzer extends AnalyzerProvider {


	public static final int MAX_CODE_LENGTH = 6;

	public PhoneticAnalyzer() {
		super("phonetic");
	}

	@Override
	public String description() {
		return "Phonetic analyzer using the DoubleMetaphoneFilter";
	}

	@Override
	public Analyzer createAnalyzer() {
		return new Analyzer() {
			@Override
			protected TokenStreamComponents createComponents(String s) {
				Tokenizer tokenizer = new StandardTokenizer();
				TokenStream stream = new DoubleMetaphoneFilter(tokenizer, MAX_CODE_LENGTH, true);
				return new TokenStreamComponents(tokenizer, stream);
			}
		};
	}
}

Pretty simple. Package into a jar and put it into Neo4j’s plugins directory along with Lucene’s phonetic jar, restart the server and then verify that our new analyzer is registered by inspecting the results of call db.index.fulltext.listAvailableAnalyzers - we should see the Phonetic analyzer listed.

Now, as before, create an index using the new analyzer:

CALL db.index.fulltext.createNodeIndex('jobs', ['Job'], ['name'], {analyzer:'phonetic'})

And query in the same manner:

CALL db.index.fulltext.queryNodes('jobs', 'fynansial') 

╒══════════════════════════════════╤══════════════════╕
│"node"                            │"score"           │
╞══════════════════════════════════╪══════════════════╡
│{"name":"Financial Administrator"}│0.2163023203611374│
└──────────────────────────────────┴──────────────────┘

This code is available on github

References:

Tissot, H., Dobson, R. Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese. J Biomed Semant 10, 17 (2019). https://doi.org/10.1186/s13326-019-0216-2