Advanced Ways To Identify Duplicate In Customer Data

Table of Contents

If you are the one who is managing customer data in your company, then it is quite obvious that you might have come across the nemesis of duplicate data. Duplicates enter into our system through manual data entry, imports from foreign platforms or from customers filling out forms. The consequences are the same and are costly to rectify.

If we were to talk about the cost that is incurred while trying to improve the data quality and remove the duplicates, you would be quite surprised. Around $600 billion are spent by companies in the US alone to improve their data quality. The dwelling of duplicate contacts, deals and companies are the most commonly found data errors in today’s data. They are generally found in CRM database and impact, customer relationship, marketing and sales campaigns and support initiatives. This all incurs a high cost of rectification.

This article talks about some of the more advanced types of duplicate data that you are bound to find in your CRM database.

Generic Terms That Are Expressed Differently
The is the most common way for duplicate data to go unnoticed in the database. The duplicates are created through common terms that are being expressed differently.

For example.
Let’s say that you are trying to find duplicated and you are using a company name as a primary way to find them. The company could be expressed differently in different customer records that are duplicates.


  • Alphabet Incorporated
  • Alphabet Inc.

As you can see that the company name was expressed in a different way and is most likely to birth a duplicate record.

Let us take another example of Job Titles

  • Chief Operating Officer
  • C.O.O.
  • COO

This is the reason why data standardization is so important. Without it, it would be near to impossible to detect duplicates. If you don’t have a standardization process in place then it is almost certain that your CRM must be filled with these kinds of duplicates.

Nicknames And Short Names
As we all know that most of us are known by multiple names. Some use shorter and casual versions of their first name or use initials or go by their nicknames.

For example, if a person’s name is John Paul Jones, you might see his name expressed in different ways across various duplicate CRM contact records.

  • John Jones
  • Jon Jones
  • Jon Paul Jones
  • Jones Paul John
  • JP Jones
  • J.P. Jones

There even could be scenarios in which nicknames like Junior, Bud or something like that could be prevalent. So the generic duplicate detection process could fail in this scenario.

Fun fact. An average human data entry error rate is around 1%. This means that for every hundred keystrokes there is one typo. You will find typos where ever humans are responsible for inputting the data. Sadly, if you have an employee or customer-facing form, instead of automated means of collecting data, then you can be sure to have duplicate data, which went unsupervised due to typos.

Common data errors with companies, like:

  • Microsoft
  • Microsift

Or with names, like:

  • Jane
  • Jame

Data errors occur when inputting the data into large customer databases. These errors make it difficult to find duplicate data.

Titles & Suffixes
Contact data with title and suffix also cause a lot of duplicate records.

Using the previous example of John Paun Jones, you could have duplicate records such as:

  • Dr. John Jones
  • Dr. Jon Jones
  • Mr. John Paul James
  • John Paul James Jr.
  • Jon James III
  • John Paul James Esq.

Title and suffix should have a great deal of consideration when it comes to data quality as it is one of the major sources of duplicates.

Website URL
Another common way to find duplicate records is by using a website URL, within a CRM.

There could be two customer records and the fields may or may not include “http://” or “www.”, which again will cause duplicate records. Or in other instances, different customer records may have different vertical domains. For instance, vs.
Another reason for duplicates is the subdomains. For example, a university could have different domain pats for different departments such as –,, etc.

All the website URL needs to be checked to ensure that your database is clear of such issues.

Matching by Similarity (Fuzzy Matching)
Depending only on “exact match” identification is bound to leave several duplicates drifting around in your CRM. There are just a lot of variations that multiple fields might have for an exact match to be effective.

Fuzzy matching is a programmatic technique that is used for analyzing data and identifying customer records that have similarities by are not exact matches. It functions by analyzing the “closeness” of two different data points.

Closeness is measured by the number of changes required to make any two data points match. This is also known as “edit distance”. Edit distance looks at the number of insertion, deletion and substitution differences, that are required to make the two data points of data match exactly.

insertion: bar → barn
deletion: barn → bar
substitution: barn → bark

Without a fuzzy matching technique in place, it would be really difficult to find duplicates in a larger database.

Secondary Check
One of the major issues is that duplicate customer records drift through the cracks because companies today are bent on identifying duplicates by using set fields, without using any secondary check.

For example, you can identify duplicates by first name, last name and phone numbers. You can capture most of the duplicates by matching records with the combination of these fields.

By using secondary checks when the first fails can help you find and remove these free-floating duplicates that were missed in the first place.

Phone Numbers in Different Formats
Phone numbers have been used to identify duplicate accounts and contacts in CRMs.

Contact with two duplicate records would be likely to have the same phone number for both the contacts. Plus, organizations do not change the mainline number often, so this can serve as a reliable field for duplicate detection.

But there could be some problems with using phone numbers as the primary source.

Firstly there are multiple ways that the phone number can be formatted in your database.

  • 123-456-7890
  • 1234567890
  • 123.456.7890
  • (123)-456-7890
  • 123 456 7890
  • 1-123-456-7890

This means that using a phone number field will leave a lot of hidden duplicates in your database.

Partial Matches
Some duplicates are those issues that are difficult to catch using VLOOKUP and Excel functions.

For example, you have a contact of a University. So contacts in different departments should be treated differently from each other because decisions in different departments are taken independently.

Partial matching techniques can be used to identify duplicates that share similarities.

For example, partial matching can be used to detect duplicate records for prospects that had their university listed in multiple ways.

  • University of London
  • University of London School of Business
  • London University School of Business

Dealing with duplicates is only a phase of the journey of managing customer data and improving results from your marketing and sale efforts. If you are looking for data enrichment services, book a free consultation call with our experts today.

unthinkable ideas