back to basics

Comparing Text using Soundex, PHONIC_ENCODE and FUZZY_MATCH

Posted on

When comparing text strings we have a number of functions on Oracle Database to help us. These include SOUNDEX, PHONIC_ENCODE and FUZZY_MATCH. Let’s have a look at what each of these can do.

The SOUNDEX function returns a character string containing the phonetic representation of a string. SOUNDEX lets you compare words that are spelled differently, but sound alike in English. To illustrate this, let’s compare some commonly spelled words and their variations, for example McDonald, MacDonald, and Smith and Smyth.

select soundex('MCDONALD') from dual;

SOUNDEX('MCDONALD') 
___________________ 
M235                

SQL> select soundex('MACDONALD') from dual;

SOUNDEX('MACDONALD') 
____________________ 
M235                 

SQL> pause  next_smith_smyth
 next_smith_smyth


SQL> 
SQL> select soundex('SMITH') from dual;

SOUNDEX('SMITH') 
________________ 
S530             

SQL> select soundex('SMYTH') from dual;

SOUNDEX('SMYTH') 
________________ 
S530             

Here we get to see SOUNDEX function returns the same code for each of the spelling variations. This function can be easily used to search for and to compare text in our tables.

Now let’s have a look at some of the different ways to spell Brendan.

select soundex('Brendan'), 
  2         soundex('BRENDAN'), 
  3  	   soundex('Breandan'),
  4  	   soundex('Brenden'),
  5  	   soundex('Brandon'),
  6  	   soundex('Brendawn'),
  7  	   soundex('Bhenden'),
  8  	   soundex('Brendin'),
  9  	   soundex('Brendon'),
 10  	   soundex('Beenden'),
 11  	   soundex('Breenden'),
 12  	   soundex('Brendin'), 
 13  	   soundex('Brendyn'), 
 14  	   soundex('Brandon'), 
 15  	   soundex('Brenainn'), 
 16  	   soundex('Bréanainn')
 17  from dual;

SOUNDEX('BRENDAN') SOUNDEX('BRENDAN') SOUNDEX('BREANDAN') SOUNDEX('BRENDEN') SOUNDEX('BRANDON') SOUNDEX('BRENDAWN') SOUNDEX('BHENDEN') SOUNDEX('BRENDIN') SOUNDEX('BRENDON') SOUNDEX('BEENDEN') SOUNDEX('BREENDEN') SOUNDEX('BRENDIN') SOUNDEX('BRENDYN') SOUNDEX('BRANDON') SOUNDEX('BRENAINN') SOUNDEX('BRÉANAINN') 
__________________ __________________ ___________________ __________________ __________________ ___________________ __________________ __________________ __________________ __________________ ___________________ __________________ __________________ __________________ ___________________ ____________________ 
B653               B653               B653                B653               B653               B653                B535               B653               B653               B535               B653                B653               B653               B653               B655                B655                 

We can see which variations of my name can be labeled as being similar in sound.

An alternative function is to use the PHONIC_ENCODE. The PHONIC_ENCODE function converts text into language-specific codes based on the pronunciation of the text. It implements the Double Metaphone algorithm and an alternative algorithm. DOUBLE_METAPHONE returns the primary code. DOUBLE_METAPHONE_ALT returns the alternative code if present. If the alternative code is not present, it returns the primary code.

select col, phonic_encode(double_metaphone,col) double_met
  2  from ( values 
  3  	    ('Brendan'), ('BRENDAN'), ('Breandan'), ('Brenden'), ('Brandon'),
  4  		('Brendawn'), ('Bhenden'), ('Brendin'), ('Brendon'), ('Beenden'),
  5  		('Breenden'), ('Brendin'), ('Brendyn'), ('Brandon'), ('Brenainn'), 
  6  		('Bréanainn')
  7       ) names (col);

COL       DOUBLE_MET 
_________ __________ 
Brendan   PRNT       
BRENDAN   PRNT       
Breandan  PRNT       
Brenden   PRNT       
Brandon   PRNT       
Brendawn  PRNT       
Bhenden   PNTN       
Brendin   PRNT       
Brendon   PRNT       
Beenden   PNTN       
Breenden  PRNT       
Brendin   PRNT       
Brendyn   PRNT       
Brandon   PRNT       
Brenainn  PRNN       
Bréanainn PRNN       

The final function we’ll look at is FUZZY_MATCH. The FUZZY_MATCH function is language-neutral. It determines the similarity between two strings and supports several algorithms. Here is a simple example, again using variations of Brendan

with names (col) as ( 
  2      values 
  3      ('Brendan'), ('BRENDAN'), ('Breandan'), ('Brenden'), ('Brandon'),
  4  	('Brendawn'), ('Bhenden'), ('Brendin'), ('Brendon'), ('Beenden'),
  5  	('Breenden'), ('Brendin'), ('Brendyn'), ('Brandon'), ('Brenainn'), 
  6  	('Bréanainn') )
  7  select a.col, b.col, fuzzy_match(levenshtein, a.col, b.col) as lev
  8  from names a, names b
  9  where a.col != b.col;

COL      COL       LEV 
________ _________ ___ 
Brendan  BRENDAN    15 
Brendan  Breandan   88 
Brendan  Brenden    86 
Brendan  Brandon    72 
Brendan  Brendawn   88 
Brendan  Bhenden    72 
Brendan  Brendin    86 
Brendan  Brendon    86 
Brendan  Beenden    72 
Brendan  Breenden   75 
Brendan  Brendin    86 
Brendan  Brendyn    86 
Brendan  Brandon    72 
Brendan  Brenainn   63 
Brendan  Bréanainn  45
...

Only a portion of the output is shown above. Similar solution would be the following and additionally we can compare the outputs from a number of the algorithms.

with names (col) as ( 
      values 
      ('Brendan'), ('BRENDAN'), ('Breandan'), ('Brenden'), ('Brandon'),
  	('Brendawn'), ('Bhenden'), ('Brendin'), ('Brendon'), ('Beenden'),
  	('Breenden'), ('Brendin'), ('Brendyn'), ('Brandon'), ('Brenainn'), 
  	('Bréanainn') )
  select a.col as col1, b.col as col2, 
         fuzzy_match(levenshtein, a.col, b.col) as lev,
         fuzzy_match(jaro_winkler, a.col, b.col) as jaro_winkler,
         fuzzy_match(bigram, a.col, b.col) as bigram,
         fuzzy_match(trigram, a.col, b.col) as trigram,
         fuzzy_match(whole_word_match, a.col, b.col) as jwhole_word,
         fuzzy_match(longest_common_substring, a.col, b.col) as lcs
  from names a, names b
  where a.col != b.col
  and ROWNUM <= 10;

COL1    COL2     LEV JARO_WINKLER BIGRAM TRIGRAM JWHOLE_WORD LCS 
_______ ________ ___ ____________ ______ _______ ___________ ___ 
Brendan BRENDAN   15           48      0       0           0  14 
Brendan Breandan  88           93     71      50           0  50 
Brendan Brenden   86           94     66      60           0  71 
Brendan Brandon   72           84     50       0           0  28 
Brendan Brendawn  88           97     71      66           0  75 
Brendan Bhenden   72           82     33      20           0  42 
Brendan Brendin   86           94     66      60           0  71 
Brendan Brendon   86           94     66      60           0  71 
Brendan Beenden   72           82     33      20           0  42 
Brendan Breenden  75           90     57      33           0  37 

By default, the output is a percentage similarity, but the UNSCALED keyword can be added to return the raw value.

with names (col) as ( 
      values 
      ('Brendan'), ('BRENDAN'), ('Breandan'), ('Brenden'), ('Brandon'),
  	('Brendawn'), ('Bhenden'), ('Brendin'), ('Brendon'), ('Beenden'),
  	('Breenden'), ('Brendin'), ('Brendyn'), ('Brandon'), ('Brenainn'), 
  	('Bréanainn') )
  select a.col as col1, b.col as col2, 
         fuzzy_match(levenshtein, a.col, b.col, unscaled) as lev,
         fuzzy_match(jaro_winkler, a.col, b.col, unscaled) as jaro_winkler,
         fuzzy_match(bigram, a.col, b.col, unscaled) as bigram,
         fuzzy_match(trigram, a.col, b.col, unscaled) as trigram,
         fuzzy_match(whole_word_match, a.col, b.col, unscaled) as jwhole_word,
         fuzzy_match(longest_common_substring, a.col, b.col, unscaled) as lcs
  from names a, names b
  where a.col != b.col
  and ROWNUM <= 10;

COL1    COL2     LEV JARO_WINKLER BIGRAM TRIGRAM JWHOLE_WORD LCS 
_______ ________ ___ ____________ ______ _______ ___________ ___ 
Brendan BRENDAN    6         0.48      0       0           0   1 
Brendan Breandan   1         0.93      5       3           0   4 
Brendan Brenden    1         0.94      4       3           0   5 
Brendan Brandon    2         0.84      3       0           0   2 
Brendan Brendawn   1         0.97      5       4           0   6 
Brendan Bhenden    2         0.82      2       1           0   3 
Brendan Brendin    1         0.94      4       3           0   5 
Brendan Brendon    1         0.94      4       3           0   5 
Brendan Beenden    2         0.82      2       1           0   3 
Brendan Breenden   2          0.9      4       2           0   3

This was another post in the ‘Back to Basics’ series of posts. Make sure to check out the other posts in the series.