Monday, May 30, 2011

Jaro-Winkler Fast Fuzzy Linkage Algorithm

If you need to perform fast fuzzy record linkage and don’t like SOUNDEX, than you might need to look into Jaro-Winkler algorithm. In this article I will explain what this algorithm does, give you a source code for SQL CLR function, and give an example of use cases for this algorithm such fuzzy linkage and probabilistic linkage.

Let’s bring everybody up to speed first.

Consider the following scenario. You have thousands (or millions for that matter) of data records and you either need to weed out duplicate records or link two data sets together which don’t have a common key.

The typical approach for finding duplicates is to perform a cross join with some tricky but deterministic type queries, and/or use SOUNDEX, LIKE and other functions which are provided by SQL Server, to determine links between primary keys for records containing duplicate information. In essence this problem is similar to linking two distinct data sets, with only difference being that links are established between different primary keys.

This works well for some scenarios, when data is well known, size is not large, it is easy to determine false duplicates and it is a one time job, which means that you will never need to do it again.

For those of us who are linking millions of records more than once in a life time there are other ways to consider.

Please welcome Jaro-Winkler algorithm.

The Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings. It is a variant of the Jaro distance metric (Jaro, 1989, 1995) and mainly used in the area of record linkage (duplicate detection). The higher the Jaro–Winkler distance for two strings is, the more similar the strings are. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match.

How is it better than SOUNDEX?

It is not. SOUNDEX provides a phonetic matching which can be useful in some scenarios, while Jaro-Winkler string distance algorithm estimates a difference or, roughly, a number of character replacements it takes to convert one string to another. Hence you may derive optimization technique, which I will talk about a little later in this article. Another advantage in comparison is that if there is a typo in a word and the typo produces different sound then SOUNDEX will not match it correctly. For example, if you typed “ytpe” instead of “type” then SOUNDEX will not give you a correct match. The codes returned for SOUNDEX(‘ytpe’) and SOUNDEX(‘type’) are the following:

ytpe

type

Y310

T100

If I use Jaro-Winkler StringDistance(‘ytpe’,’type’) here is what I get 0.91(6), which means it is a good match. Remember 1 – is exact match, so 0.91 is very close to it. Typically you will have something like this:

SELECT * 
FROM myTable1 as t1 
INNER JOIN myTable2 as t2 
ON dbo.StringDistance(t1.MyField, t2.MyField) > 0.85
AND -- some other filters such as string Length

Jaro-Winkler In Code

Below I will provide a SQL CLR version of this algorithm. Which is written in C#.NET and can be deployed and invoked as a function from T-SQL scripts to run on your SQL Server.

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{
/// <summary>
/// This region contains code related to Jaro Winkler string distance algorithm. 
/// </summary>
#region Jaro Distance 
private const double defaultMismatchScore = 0.0;
private const double defaultMatchScore = 1.0;
/// <summary>
/// gets the similarity of the two strings using Jaro distance.
/// </summary>
/// <param name="firstWord"></param>
/// <param name="secondWord"></param>
/// <returns>a value between 0-1 of the similarity</returns>
/// 
[Microsoft.SqlServer.Server.SqlFunction]
public static  System.Data.SqlTypes.SqlDouble StringDistance(string firstWord, string secondWord) {
if ((firstWord != null) && (secondWord != null)) 
{
if (firstWord == secondWord)
{
return (SqlDouble)defaultMatchScore;
}
else
{
//get half the length of the string rounded up - (this is the distance used for acceptable transpositions)
int halflen = Math.Min(firstWord.Length, secondWord.Length) / 2 + 1;
//get common characters
StringBuilder common1 = GetCommonCharacters(firstWord, secondWord, halflen);
int commonMatches = common1.Length;
//check for zero in common
if (commonMatches == 0)
{
return (SqlDouble)defaultMismatchScore;
}
StringBuilder common2 = GetCommonCharacters(secondWord, firstWord, halflen);
//check for same length common strings returning 0.0f is not the same
if (commonMatches != common2.Length)
{
return (SqlDouble)defaultMismatchScore;
}
//get the number of transpositions
int transpositions = 0;
for (int i = 0; i < commonMatches; i++)
{
if (common1[i] != common2[i])
{
transpositions++;
}
}
int j = 0;
j += 1;
//calculate jaro metric
transpositions /= 2;
double tmp1;
tmp1 = commonMatches / (3.0 * firstWord.Length) + commonMatches / (3.0 * secondWord.Length) +
(commonMatches - transpositions) / (3.0 * commonMatches);
return (SqlDouble)tmp1;
}
}
return (SqlDouble)defaultMismatchScore;
}
/// <summary>
/// returns a string buffer of characters from string1 within string2 if they are of a given
/// distance seperation from the position in string1.
/// </summary>
/// <param name="firstWord">string one</param>
/// <param name="secondWord">string two</param>
/// <param name="distanceSep">separation distance</param>
/// <returns>a string buffer of characters from string1 within string2 if they are of a given
/// distance seperation from the position in string1</returns>
private static StringBuilder GetCommonCharacters(string firstWord, string secondWord, int distanceSep) {
if ((firstWord != null) && (secondWord != null)) {
StringBuilder returnCommons = new StringBuilder(20);
StringBuilder copy = new StringBuilder(secondWord);
int firstLen = firstWord.Length;
int secondLen = secondWord.Length;
for (int i = 0; i < firstLen; i++) {
char ch = firstWord[i];
bool foundIt = false;
for (int j = Math.Max(0, i - distanceSep);
!foundIt && j < Math.Min(i + distanceSep, secondLen);
j++) {
if (copy[j] == ch) {
foundIt = true;
returnCommons.Append(ch);
copy[j] = '#';
}
}
}
return returnCommons;
}
return null;
}
#endregion
#region String Functions
[Microsoft.SqlServer.Server.SqlFunction]
public static System.Data.SqlTypes.SqlInt32 FirstIndexOf(string text, string searchWord)
{
if ((searchWord == null) ||
(searchWord.GetType() == typeof(DBNull)))
{
searchWord = "";
}
if ((text == null) ||
(text.GetType() == typeof(DBNull)))
{
text = "";
}
return (SqlInt32) text.IndexOf(searchWord);
}
[Microsoft.SqlServer.Server.SqlFunction]
public static System.Data.SqlTypes.SqlInt32 FirstIndexOfPattern(string text, string pattern)
{
Regex myRegEx = new Regex(pattern);
if (!myRegEx.IsMatch(text))
{
return -1;
}
return myRegEx.Match(text).Index;
}
[Microsoft.SqlServer.Server.SqlFunction]
public static System.Data.SqlTypes.SqlInt32 LastIndexOf(string text, string searchWord)
{
if ((searchWord == null) ||
(searchWord.GetType() == typeof(DBNull)))
{
searchWord = "";
}
if ((text == null) ||
(text.GetType() == typeof(DBNull)))
{
text = "";
}
return (SqlInt32)text.LastIndexOf(searchWord);
}
[Microsoft.SqlServer.Server.SqlFunction]
public static System.Data.SqlTypes.SqlBoolean Contains(string text, string searchWord)
{
if ((searchWord == null) ||
(searchWord.GetType() == typeof(DBNull)))
{
searchWord = "";
}
if ((text == null) ||
(text.GetType() == typeof(DBNull)))
{
text = "";
}
return (SqlBoolean)text.Contains(searchWord);
}
#endregion
};

I do provide additional string functions for convenience. To deploy these functions you will need to create CLR project in Visual Studio and paste code from here. Specify your target server and click deploy.


How to optimize?

If you are working with a somewhat static set such as for spell check, then I recommend doing the following to optimize performance:

  • calculate Length of the strings and save it into one of the columns, then during comparison exclude everything that is substantially different in length than a compared string. So if compared string has 10 characters then limit your search to only strings with 10 ± 3 in length and don’t worry about the rest. Most likely other strings will be a bad match anyway.
  • add index to both the strings which participate in comparison and a sorted index for string length. In fact sort the whole table by string length and then by string values.
  • you may also fine tune on how many results you like to get by adjusting the distance weight either lower to get more results or higher to get less results, 0.85 seems to work well for me.
  • clean your data
  • remove insignificant characters. If strings are too long and you have something like MYCOMPANY ABC and MYCOMPANY CDE  you will get false positives, if these are in fact two different names. So in this case remove white spaces and MYCOMPANY. If these two strings represent the same name than simply remove white spaces and limit comparison by length, using substring as a percentage of the total length.
  • LEARN YOUR DATA. Fine tuning will ultimately needs to be performed over a specific set, so knowing exactly what is in your tables is important.

Typically in the vocalbulary of 100 000 entries such filters reduce the number of comparisons to about 300. Unless you are doing a full blown record linkage in that case you will need to apply probabilistic methods and calculate scores.

Also in MS SQL Server Jaro Winkler string distance wrapped into CLR function perform much better, since SQL Server doesn't support arrays natively and much of the processing in this algorithm revolves around arrays. So implementation in T-SQL adds too much overhead, but SQL-CLR works extremely fast.

Applications


I have personally applied this algorithm for a Spell-Check program against large medical dictionary. It worked really well (under 0.5 sec) on modest hardware (dual CPU with 2 GB RAM and a SCSI drives) and 150 000 dictionary words.

Also it can be used in Probabilistic record linkage for string comparisons. Probabilistic Record linkage methodology calculates probability weights for strings within dataset. For common string values are lower than than for uncommon. The weights from different calculations are added to a final score. The scores most likely will have a normal distribution. There will be positives, false positives, false negatives and negatives. A threshold is determined for false positive or false negatives. You definitely keep positives and remove negatives, for the rest apply somewhat manual review. Some apps have these steps very well automated, for example SQL Match program developed by Patrick Smith. I know Patrick personally and worked with him for several years, what he claims on this web site is true and works just as described.

The other use is for finding duplicates. This is most common scenario for data cleansing. I have also used it for fuzzy matching and data cleaning efforts multiple times, providing users with the ability to look at the matches in different ways.

In my next post I will try to apply this algorithm for word suggestions on a mobile platform Windows Phone 7. I am interested in how well it will perform given the fact that SQL CE is very limited and some parts will have to be implemented in plain C#, I’ll try to use LINQ though.

Happy coding!

3 comments:

  1. Who wrote the code? Is this public domain?

    ReplyDelete
  2. What is the license of this code? MIT?

    ReplyDelete