Today's guest blogger is Chris LaBadie, a Senior Database Developer at Sonoma Partners
Whenever we work on a data project for a customer the subject of cleaning their client data is always a discussion point. The project could involve a migration from one CRM platform to another, integrating an ERP system into CRM, or even cleaning their data “in place”- but the message is usually the same, “we know we have duplicates in our data, help us clean it up!”
Anybody that has ever tried to track people or companies knows that it can be a huge challenge to avoid duplicate data. When you have multiple users maintaining data, it is very common to introduce duplicate data no matter the de-duplication safeguards your system uses.
Working with people-
• Names can be difficult to track (misspellings, maiden names, nicknames, etc).
• People move/change contact information.
• Even unique fields like address or email address can be shared amongst more than one person.
Working with companies-
• Abbreviations or acronyms in names can present a challenge.
• Companies can have multiple locations.
• Often use different addresses to track billing, shipping, etc.
In our experience, there isn’t a magic bullet to eliminate de-duplication. The best solution is usually a layered approach- use form validation to ensure quality data entry, intelligent system design to store records in an organized manner, and system de-duplication rules to search out potential duplicate data and present to a user records that might match the information they are attempting to enter.
However, a common project for Sonoma Partners is moving a client to a new CRM platform and an important part of migrating data to the new platform is to identify potential duplicate data before it ever reaches the new system. While this may sound like a large effort, it is actually pretty easy thanks to the Fuzzy Grouping functionality built into SQL Server Integration Services (SSIS).
Fuzzy Grouping allows SSIS to inspect a set of data and compare one or more fields in the dataset. Rather than comparing the field data, Fuzzy Grouping will match strings based on their sounds- giving more accurate results based on how a person would hear the string while overcoming misspellings, typos, abbreviations, nicknames, etc. Note, you will need SQL Server Enterprise or SQL Server Developer edition to use Fuzzy Grouping. This example is developed using the Business Intelligence Design Studio (BIDS) in SQL Server 2012.
In this demonstration we will process an Excel file of contacts. This file can be generated from any other system and used in a process like this to identify potential duplicate data before that data is migrated to CRM. The end result of this process is to produce a file containing potential duplicates so they can be reviewed and cleaned up in the source system before data migration.
Fuzzy Grouping evaluates the file of Contacts and compares them based on selected fields. Potentially duplicate records are grouped together and assigned a group number.
After Fuzzy Grouping, the process splits so it can sort the results based on their Fuzzy Grouping group number and count the number of records per group.
Finally, a step to check the group count determines which records are potential dupes- a group count of 1 means the record is unique, anything more than 1 means that group contains 1 or more records that should be reviewed. The potential duplicates are then exported to an Excel file for review.
Configuring Fuzzy Grouping is a pretty straight-forward process, just select the fields you want to compare and set the minimum amount of similarity (roughly a percentage of matching). This process can involve a little bit of trial and error while you fine-tune the Fuzzy Grouping to identify the records that are potential duplicates without letting through any false positives.
Typically we will start with lower minimums and go up until we are seeing the desired results.
When you run the process, you can see how many records it processes and how many records SSIS ultimately decided to export to Excel for review.
When the process is complete, you can view the results in your Excel file. Here you can see the unique record number assigned to the record (KeyIn), group number (KeyOut), overall score (percentage of match to the potential duplicate record), similarities for First Name/Last Name/City (percentage of match for each column), and the Group Count (number of potential duplicates per record group). The highlighted values in the screenshot show some of the values that were compared and demonstrate how Fuzzy Grouping can identify potential duplicates despite common misspellings, nicknames, and partial matches.
In conclusion, Fuzzy Grouping is an easy to use and powerful tool to assist in any data cleanup effort. It is simple to setup, and quickly evaluates large amounts of data. SSIS can provide you with all of the tools to make informed decisions regarding your customer data, your most valuable asset.