Monday, May 30, 2011

Jaro-Winkler Fast Fuzzy Linkage Algorithm

If you need to perform fast fuzzy record linkage and don’t like SOUNDEX, than you might need to look into Jaro-Winkler algorithm. In this article I will explain what this algorithm does, give you a source code for SQL CLR function, and give an example of use cases for this algorithm such fuzzy linkage and probabilistic linkage.

Let’s bring everybody up to speed first.

Consider the following scenario. You have thousands (or millions for that matter) of data records and you either need to weed out duplicate records or link two data sets together which don’t have a common key.

The typical approach for finding duplicates is to perform a cross join with some tricky but deterministic type queries, and/or use SOUNDEX, LIKE and other functions which are provided by SQL Server, to determine links between primary keys for records containing duplicate information. In essence this problem is similar to linking two distinct data sets, with only difference being that links are established between different primary keys.

This works well for some scenarios, when data is well known, size is not large, it is easy to determine false duplicates and it is a one time job, which means that you will never need to do it again.

For those of us who are linking millions of records more than once in a life time there are other ways to consider.

Please welcome Jaro-Winkler algorithm.

The Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings. It is a variant of the Jaro distance metric (Jaro, 1989, 1995) and mainly used in the area of record linkage (duplicate detection). The higher the Jaro–Winkler distance for two strings is, the more similar the strings are. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match.

How is it better than SOUNDEX?

It is not. SOUNDEX provides a phonetic matching which can be useful in some scenarios, while Jaro-Winkler string distance algorithm estimates a difference or, roughly, a number of character replacements it takes to convert one string to another. Hence you may derive optimization technique, which I will talk about a little later in this article. Another advantage in comparison is that if there is a typo in a word and the typo produces different sound then SOUNDEX will not match it correctly. For example, if you typed “ytpe” instead of “type” then SOUNDEX will not give you a correct match. The codes returned for SOUNDEX(‘ytpe’) and SOUNDEX(‘type’) are the following:

ytpe

type

Y310

T100

If I use Jaro-Winkler StringDistance(‘ytpe’,’type’) here is what I get 0.91(6), which means it is a good match. Remember 1 – is exact match, so 0.91 is very close to it. Typically you will have something like this:

SELECT * 
FROM myTable1 as t1 
INNER JOIN myTable2 as t2 
ON dbo.StringDistance(t1.MyField, t2.MyField) > 0.85
AND -- some other filters such as string Length

Jaro-Winkler In Code

Below I will provide a SQL CLR version of this algorithm. Which is written in C#.NET and can be deployed and invoked as a function from T-SQL scripts to run on your SQL Server.

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{
/// <summary>
/// This region contains code related to Jaro Winkler string distance algorithm. 
/// </summary>
#region Jaro Distance 
private const double defaultMismatchScore = 0.0;
private const double defaultMatchScore = 1.0;
/// <summary>
/// gets the similarity of the two strings using Jaro distance.
/// </summary>
/// <param name="firstWord"></param>
/// <param name="secondWord"></param>
/// <returns>a value between 0-1 of the similarity</returns>
/// 
[Microsoft.SqlServer.Server.SqlFunction]
public static  System.Data.SqlTypes.SqlDouble StringDistance(string firstWord, string secondWord) {
if ((firstWord != null) && (secondWord != null)) 
{
if (firstWord == secondWord)
{
return (SqlDouble)defaultMatchScore;
}
else
{
//get half the length of the string rounded up - (this is the distance used for acceptable transpositions)
int halflen = Math.Min(firstWord.Length, secondWord.Length) / 2 + 1;
//get common characters
StringBuilder common1 = GetCommonCharacters(firstWord, secondWord, halflen);
int commonMatches = common1.Length;
//check for zero in common
if (commonMatches == 0)
{
return (SqlDouble)defaultMismatchScore;
}
StringBuilder common2 = GetCommonCharacters(secondWord, firstWord, halflen);
//check for same length common strings returning 0.0f is not the same
if (commonMatches != common2.Length)
{
return (SqlDouble)defaultMismatchScore;
}
//get the number of transpositions
int transpositions = 0;
for (int i = 0; i < commonMatches; i++)
{
if (common1[i] != common2[i])
{
transpositions++;
}
}
int j = 0;
j += 1;
//calculate jaro metric
transpositions /= 2;
double tmp1;
tmp1 = commonMatches / (3.0 * firstWord.Length) + commonMatches / (3.0 * secondWord.Length) +
(commonMatches - transpositions) / (3.0 * commonMatches);
return (SqlDouble)tmp1;
}
}
return (SqlDouble)defaultMismatchScore;
}
/// <summary>
/// returns a string buffer of characters from string1 within string2 if they are of a given
/// distance seperation from the position in string1.
/// </summary>
/// <param name="firstWord">string one</param>
/// <param name="secondWord">string two</param>
/// <param name="distanceSep">separation distance</param>
/// <returns>a string buffer of characters from string1 within string2 if they are of a given
/// distance seperation from the position in string1</returns>
private static StringBuilder GetCommonCharacters(string firstWord, string secondWord, int distanceSep) {
if ((firstWord != null) && (secondWord != null)) {
StringBuilder returnCommons = new StringBuilder(20);
StringBuilder copy = new StringBuilder(secondWord);
int firstLen = firstWord.Length;
int secondLen = secondWord.Length;
for (int i = 0; i < firstLen; i++) {
char ch = firstWord[i];
bool foundIt = false;
for (int j = Math.Max(0, i - distanceSep);
!foundIt && j < Math.Min(i + distanceSep, secondLen);
j++) {
if (copy[j] == ch) {
foundIt = true;
returnCommons.Append(ch);
copy[j] = '#';
}
}
}
return returnCommons;
}
return null;
}
#endregion
#region String Functions
[Microsoft.SqlServer.Server.SqlFunction]
public static System.Data.SqlTypes.SqlInt32 FirstIndexOf(string text, string searchWord)
{
if ((searchWord == null) ||
(searchWord.GetType() == typeof(DBNull)))
{
searchWord = "";
}
if ((text == null) ||
(text.GetType() == typeof(DBNull)))
{
text = "";
}
return (SqlInt32) text.IndexOf(searchWord);
}
[Microsoft.SqlServer.Server.SqlFunction]
public static System.Data.SqlTypes.SqlInt32 FirstIndexOfPattern(string text, string pattern)
{
Regex myRegEx = new Regex(pattern);
if (!myRegEx.IsMatch(text))
{
return -1;
}
return myRegEx.Match(text).Index;
}
[Microsoft.SqlServer.Server.SqlFunction]
public static System.Data.SqlTypes.SqlInt32 LastIndexOf(string text, string searchWord)
{
if ((searchWord == null) ||
(searchWord.GetType() == typeof(DBNull)))
{
searchWord = "";
}
if ((text == null) ||
(text.GetType() == typeof(DBNull)))
{
text = "";
}
return (SqlInt32)text.LastIndexOf(searchWord);
}
[Microsoft.SqlServer.Server.SqlFunction]
public static System.Data.SqlTypes.SqlBoolean Contains(string text, string searchWord)
{
if ((searchWord == null) ||
(searchWord.GetType() == typeof(DBNull)))
{
searchWord = "";
}
if ((text == null) ||
(text.GetType() == typeof(DBNull)))
{
text = "";
}
return (SqlBoolean)text.Contains(searchWord);
}
#endregion
};

I do provide additional string functions for convenience. To deploy these functions you will need to create CLR project in Visual Studio and paste code from here. Specify your target server and click deploy.


How to optimize?

If you are working with a somewhat static set such as for spell check, then I recommend doing the following to optimize performance:

  • calculate Length of the strings and save it into one of the columns, then during comparison exclude everything that is substantially different in length than a compared string. So if compared string has 10 characters then limit your search to only strings with 10 ± 3 in length and don’t worry about the rest. Most likely other strings will be a bad match anyway.
  • add index to both the strings which participate in comparison and a sorted index for string length. In fact sort the whole table by string length and then by string values.
  • you may also fine tune on how many results you like to get by adjusting the distance weight either lower to get more results or higher to get less results, 0.85 seems to work well for me.
  • clean your data
  • remove insignificant characters. If strings are too long and you have something like MYCOMPANY ABC and MYCOMPANY CDE  you will get false positives, if these are in fact two different names. So in this case remove white spaces and MYCOMPANY. If these two strings represent the same name than simply remove white spaces and limit comparison by length, using substring as a percentage of the total length.
  • LEARN YOUR DATA. Fine tuning will ultimately needs to be performed over a specific set, so knowing exactly what is in your tables is important.

Typically in the vocalbulary of 100 000 entries such filters reduce the number of comparisons to about 300. Unless you are doing a full blown record linkage in that case you will need to apply probabilistic methods and calculate scores.

Also in MS SQL Server Jaro Winkler string distance wrapped into CLR function perform much better, since SQL Server doesn't support arrays natively and much of the processing in this algorithm revolves around arrays. So implementation in T-SQL adds too much overhead, but SQL-CLR works extremely fast.

Applications


I have personally applied this algorithm for a Spell-Check program against large medical dictionary. It worked really well (under 0.5 sec) on modest hardware (dual CPU with 2 GB RAM and a SCSI drives) and 150 000 dictionary words.

Also it can be used in Probabilistic record linkage for string comparisons. Probabilistic Record linkage methodology calculates probability weights for strings within dataset. For common string values are lower than than for uncommon. The weights from different calculations are added to a final score. The scores most likely will have a normal distribution. There will be positives, false positives, false negatives and negatives. A threshold is determined for false positive or false negatives. You definitely keep positives and remove negatives, for the rest apply somewhat manual review. Some apps have these steps very well automated, for example SQL Match program developed by Patrick Smith. I know Patrick personally and worked with him for several years, what he claims on this web site is true and works just as described.

The other use is for finding duplicates. This is most common scenario for data cleansing. I have also used it for fuzzy matching and data cleaning efforts multiple times, providing users with the ability to look at the matches in different ways.

In my next post I will try to apply this algorithm for word suggestions on a mobile platform Windows Phone 7. I am interested in how well it will perform given the fact that SQL CE is very limited and some parts will have to be implemented in plain C#, I’ll try to use LINQ though.

Happy coding!

Monday, May 9, 2011

Goodbye Inverse Control Flow, Hello Async!!!

In this post I will talk about new Async features in .NET Framework (still in CTP with SP1 Refresh mode as of May 2011). These features add new syntax and implicit control structures by compiler to seamlessly convert sequential code into asynchronous one. Let us see in detail what are the problems we currently have with asynchronous method invocations, handling control flow and exceptions, and also how new features can help us (e.g. programmers, developers, coders and etc.) manage our asynchronous code in much more elegant way.

In essence using new Async features will help reduce complexity of asynchronous code by introducing new syntax. This syntax is simple enough and make asynchronous code look almost like sequential code. Under the hood compiler converts these new structures into implicit closures and returns a continuation delegate using class Task or Task<T>. The caller in turn receives partial results right away in the form of instance of Task, Task<T>, which holds continuation reference, or void (in case of “fire and forget”). Caller is also able to monitor task progress as well as get results once task is completed. Once asynchronous process is completed a continuation delegate is called which populates final result of type T (in case of Task<T>) and an event on Task is raised, which allows callers to react appropriately. Errors are raised within the context of async method and handled just as any other errors are handled in sequential code.

Most of this is compiler generated and there is no more a need to wire events, closures, continuations and error handling manually. What in the world is this and will I ever have a need for this?

When will it make sense?

It depends how much parallelism your application needs/requires. With a broad adoption of multi-core and recently many-core processors people expect applications to be faster and more responsive; until applications know how to run code asynchronously there isn’t going to be much advantage in using multi-core processors by one application, besides many blocking threads are computationally expensive.

For example, if there is a need or a way to request N (where N>1) downloads from remote resource and network delay in accessing each resource as delta_t1..delta_tn then sequentially written code will have to wait at least Sum(delta_ti)(for all i=1..n) and on top of that actual download time. In case of asynchronous code the delay will be Max(delta_ti)(where i=1..n) + overhead_time. The overhead_time is the time it takes to manage different threads and effectively execute them in parallel, it also depends on the availability of resources (for multiple intensive IO threads it will just add too much overhead for context switching) and grows somewhat proportionately to the number of threads. Therefore it makes sense to use asynchrony when Sum(delta_ti) > (Max(delta_ti) + overhead_time).

Therefore processes such as web requests, data requests from a server, intensive computations are all good candidates for running in parallel or at least off of the main or UI threads. For IOs it will be when they are running separate from UIs or main threads. As I mentioned running multiple intensive IOs over a single shared resource can increase overhead rather rapidly. But in case of independent resources running IOs on different threads will actually improve performance.

So the rule of thumb. Can the infrastructure effectively support parallelism?
- If yes, then processing request over such infrastructure asynchronously will be beneficial.
- If the answer is no, then there is a chance of overhead for context switching between different threads will be greater.

Current State of Asynchrony

Currently .NET 4.0 Framework implements Task and Task<T>. MSDN has great examples and design patterns using this class for asynchronous method calls see here, here and here. I mostly use .NET 3.5 and thus have to have more plumbing.

Let’s look at how we can execute asynchronous task without Task and Async features and the problems we run into while doing it. Here is the main idea:

some method

// do some prelim work

BackgroundWorker worker = new BackgroundWorker(); // some kind of way to spin-off tasks in the form of delegates(e.g. function pointers)

// need to run long process, but want to free up UI

Action myDelegate = delegate
{

     do.this(); 
     do.that(); 
     do_for_very_long.this(); 

};

EventHandler workerCompleted = (s, ea) =>
{

     // unhook event handlers
    worker.Completed -= workerCompleted; 
    worker.Cancelled -= workerCancelled; 
     …
     // handle results

};

EventHandler workerCancelled = (s, ea) =>
{

     // unhook event handlers
    worker.Completed -= workerCompleted; 
    worker.Cancelled -= workerCancelled; 
     …
     // handle errors

};

    worker.Completed += workerCompleted; 
    worker.Cancelled += workerCancelled; 
    worker.RunAsync(myDelegate);

end of some method

What you see above is inverse flow of control, closures, lambda functions, continuations. But this is typically what happens when there is a need to execute asynchronous process.

1. We create local variables on the main thread, which are part of continuation state for spin-off threads.

Here is a definition from Wikipedia:

“In computer science and programming, a continuation is an abstract representation of the control state. A continuation reifies an instance of a computational process at a given point in the process's execution. It contains information such as the process's current stack (including all data whose lifetime is within the process e.g. "local variables"), as well the process's point in the computation. Such an instance can then be later resumed upon invocation. The "current continuation" or "continuation of the computation step" is the continuation that, from the perspective of running code, would be derived from the current point in a program's execution.

The term continuations can also be used to refer to first-class continuations, which are constructs that give a programming language the ability to save the execution state at any point and return to that point at a later point in the program.”

In other words Async does continuations complitely behind the scenes while here we have to wire up events to listen for a callback when async method completes.

2. We define the execution steps for the asynchronous work, but do not run it yet. This may or may not use closure. Closure is a way of passing continuation state. So if we were to reference in this function variables which were previously defined on our main thread then compiler will automatically preserve the references for the time when function execution will occur, subject to GC rules. Most commonly closures are used with lambda functions or anonymous function delegates, while these are different concepts.

The other way is to explicitly create a named function with arguments and pass all data via these arguments. Sometimes this could be a daunting task.

Here is a definition for closures:

“In computer science, a closure is a function together with a referencing environment for the nonlocal names (free variables) of that function. Such a function is said to be "closed over" its free variables. The referencing environment binds the nonlocal names to the corresponding variables in scope at the time the closure is created, additionally extending their lifetime to at least as long as the lifetime of the closure itself.

Closures are used to implement continuation passing style, and in this manner, hide state. Constructs such as objects and control structures can thus be implemented with closures. In some languages, a closure may occur when a function is defined within another function, and the inner function refers to local variables of the outer function. At runtime, when the outer function executes, a closure is formed, consisting of the inner function’s code and references to any variables of the outer function required by the closure; such variables are called the upvalues of the closure.

The term closure is often mistakenly used to mean anonymous function. This is probably because most languages implementing anonymous functions allow them to form closures and programmers are usually introduced to both concepts at the same time. These are, however, distinct concepts. A closure retains a reference to the environment at the time it was created (for example, to the current value of a local variable in the enclosing scope) while a generic anonymous function need not do this.”

and lambda functions:

“In computing, an anonymous function (also function constant or function literal) is a function (or a subroutine) defined, and possibly called, without being bound to an identifier. Anonymous functions are convenient to pass as an argument to a higher-order function and are ubiquitous in languages with first-class functions.

Anonymous functions originate in the work of Alonzo Church in his invention of the lambda calculus in 1936 (prior to electronic computers), in which all functions are anonymous. The Y combinator can be utilised in these circumstances to provide anonymous recursion, which Church used to show that some mathematical questions are unsolvable by computation. (Note: this result was disputed at the time, and later Alan Turing - who became Church's student - provided a proof that was more generally accepted.)

Anonymous functions have been a feature of programming languages since Lisp in 1958. An increasing number of modern programming languages support anonymous functions, and some notable mainstream languages have recently added support for them, the most widespread being JavaScript also C# and PHP support anonymous functions. Anonymous functions were added to the C++ language as of C++0x

3. We define two event handlers to handle successful completion and exceptional completion.

These are defined using lambda functions and they do use closures to unhook event handlers after event is triggered.

Still no execution occurred yet and therefore up to this point everything is running blazing fast.

4. We are registering for events by hooking up our error handlers.

5. We spin-off asynchronous thread and immediately return. This is where execution actually starts but function call has already completed. Now control flow goes through the function(s) we defined earlier in our method and, once completed, will either call Completed handler or Cancelled handler if there are errors. This way of executing is called Inversion of Control Flow.

“In computer programming, Inversion of control (IoC) is an abstract principle describing an aspect of some software architecture designs in which the flow of control of a system is inverted in comparison to procedural programming.

In traditional programming the flow of the business logic is controlled by a central piece of code, which calls reusable subroutines that perform specific functions. Using Inversion of Control this "central control" design principle is abandoned. The caller's code deals with the program's execution order, but the business knowledge is encapsulated by the called subroutines.”

The problem with “Inversion of Control” is that it is not linear and adds a lot of noise (plumbing) to business logic. It is very easy to introduce bugs using this style of programming.

What a mess, huh? So how is it different with Async?

If you read “Async Whitepaper” you will notice a somewhat similar example. But let us walk use our example and see how we can benefit from Async.

First we need to change our method declaration to use keyword async and a return type Task<T>, that is if we are planning to return some results of type T.

private async Task<string> MyFunctionAsync()
{

string result = String.Empty;
Action<string> myDelegate = delegate
{

     do.this(); 
     do.that(); 
     do_for_very_long.this();  
     return “success”;

};

try
{   

   result = await Task.Run(myDelegate); // notice AWAIT keyword. This is new syntax!!!

   // the rest of the code will be wrapped into continuation, 
   // and called only when the above finished executing. 
   // while our MyFunctionAsync will return immediately with Task<string>, 
   // but string will be empty until this part is finished.

   // handle when success

   …

}
catch
{

   // handle when failure
    result = “failure”;

}
return result;

}

This code looks a lot simpler and less error prone, while executes almost exactly the same as the one we described earlier!

In conclusion. We looked at the mechanisms used to implement new Async functionality in C# Async CTP libraries. Talked about when it is appropriate to use asynchronous programming model. How it is currently being done and the number of problems associated with current asynchronous programming style. At the end we took a look at the new features of Async library (it will be part of C# 5.0 compiler later on) and a much more simplified programming model. We pointed to original whitepaper for Async library in hopes that readers will refer to it as well. In the process we have discussed some computer science concepts such as closures, lambda functions, continuations and inverse of control.

I hope some of the readers will find this information useful.

Happy coding!