Data Cleansing Demystified

Article
Arm, wearing marigold glove, holding unbranded cleaning spray

Corporate transactions such as mergers and acquisitions recognise the importance of the databases of customers and enquirers as part of the ‘Goodwill’. This value resides in the future business that can be derived from the data when it is marketed to. But the value of the ‘data’ is not fixed.

The ‘quality’ of the data

As time passes, the ‘facts’ change. Business details change (because they move, or go bust). Contacts within organisations change because of promotion and retirement, for example. The problem is that while the ‘data’ remains static, the ‘facts’ change with time. These inaccuracies clutter the minds of sales people and distract from sales operations.

As the distance between ‘data’ and ‘facts’ grows, marketing effectiveness drops. This means that without other action, the reward from using the data decreases over time.

As datasets are developed by any organisation, duplicates tend to creep in. Historically duplication has been little more than a costly and slightly embarrassing feature of data. GDPR changes the complexion of duplication into something rather more important.

Additionally, requirements of the data may evolve, not least because of legal changes. We now need to be able to demonstrate our work, such as consent, if that is required.

The ‘risks’ in the data

The latest changes to our data protection laws, including GDPR, have introduced additional considerations. There are also now significant risks in using inaccurate data, especially as the law states ‘data shall be accurate and, where necessary, kept up to date’. However, if unaddressed, the passage of time magnifies the distance between ‘data’ and ‘facts’, so the risks involved in using the data increases over time.

Duplicated records also now pose a risk. GDPR says data subjects have an absolute right to object to direct marketing. All duplicates need to be marked as objectors to avoid potential risks. Clearly, most people will understand your duplication issues if you explain, but some may complain, which may lead to the ICO taking an interest in you.

Many people wonder about the real scale of the problem.

The size of the challenge

Government figures consistently show approximately 1 in every 8 companies dissolve each year, and many more re-locate, though no one knows exactly how many.

There is no Government information about changes to decision maker details, but at Corpdata we have been updating our data by telephone for well over 25 years. From our research we know that on average organisational decision makers change every 20 months.

In broad terms this means the contact details for one in 5 organisations and well over half of decision makers change each year. These figures may seem large, but they are echoed by the results of our telephone research.

If you want to check your data set, do a simple test. Take a random sample of the data and contact them, to see who is still there, you will then have a much better understanding of the degree to which your information has decayed. The information you use to generate your new business is usefully considered as a risk versus reward matter. Unless you act the passage of time causes risks to rise and rewards to fall.

But compiling a list of customers and enquirers is very expensive and takes a long time. So it makes sense to take action to preserve it’s value. Taking action to preserve the value makes business sense. This is where maintenance and data cleansing may help.

Some numbers

  • 925,783 telephone numbers were added to the TPS or CTPS list in 2018
  • Over 93,000 people of working age died in the UK in 2018
  • 672,890 companies were formed and 508,865 were dissolved in 2018-2019
  • Over 250,000 postcodes are 'terminated' each month
  • Over 10,000 businesses change their decision makers or address each month

What is data cleansing and how to do it

Data cleansing is the process of minimising the distance between ‘data’ and ‘facts’.

Sometimes people can be encouraged to keep their data up to date, perhaps through an online tool, but this is quite unusual in a B2B context. Maintaining the data in-house is quite easy if you are in frequent contact with people, and you have good procedures in place. Without robust procedures, updating the data is likely to be haphazard, meaning you still cannot rely upon it’s accuracy. Many companies find this a laborious task, and one that is often hard to prioritise.

Basic duplication can be identified by sorting your data by various items, such as telephone number, email address, postcode and company name. This will help you see groups of similar records and reduce the more obvious challenges.

Data standardisation sounds very dry, and a matter exclusively for larger enterprises, but your operations can be greatly simplified by having standard data formats. For example, it doesn't matter how you record phone numbers, but you should do the same every time. 01626777400 is the same as (01626) 777400 or +44 (1626) 777 400, or many other layouts, but managing your data, especially identifying duplication, is much more difficult where no standards are applied.

You may choose to standardise existing data, but beware of damaging the data quality. For example, replacing every occurrence of "St." with "Street" will leave you with some odd data, such as "Street Michael's Street".

There are ways others can help you cleanse your business database; it’s quite straight forward to employ someone to call people and check their details, though this may prove to be expensive. Another approach is to match and update your database to one that you know to be accurate and up to date.

The ‘match and update’ process can be a quick and cost effective solution. If you have a high value portion of the data that you need to get spot on, a hybrid featuring some telephone checking is often a sensible approach when matching doesn't help.

Bear in mind though, cost should only be one of the factors in selecting your supplier.

Choosing a data cleansing supplier

Here are four important considerations to help you choose a suitable third party for data cleansing, (under GDPR, they are a classed as a ‘data processor’):

  1. Supplier trustworthiness
  2. You will be passing this very precious resource to a third party. Ask yourself how valuable that would be to your greatest competitor, or how damaging it would be for data subjects if their data were released. You need to ensure your data will be secure and not passed to anyone else, or used for other purposes than cleaning. Our ‘Due Diligence Questions to Ask Data Suppliers’ are an excellent place to start.

  3. GDPR compliance
  4. If your database contains personal data, it is important to conduct suitable due diligence on your prospective data cleans partner. The key issues are about their undertakings as a data processor, to you as data controller. This is covered by article 28 of GDPR. Once again our ‘Due Diligence Questions to Ask Data Suppliers’ are a great start.

  5. Quality of reference dataset
  6. Many suppliers boast massive numbers of records in their data set, and that they are always up-to-date. These two things are almost impossible to achieve. If you match your data to something which is only as up-to-date as your data (or heaven forbid - older), your data will not be improved.

  7. Quality of matching
  8. Matching is an ‘artistic science’. It is far from simple to create a good matching tool, and a great tool is one which is tuned to return (almost) no ‘false positives’ (see ‘About Corpdata Matching Systems’ below). Suppliers claiming high match rates typically have a significant proportion of false positives in the results.


False Positives

A false positive is where a ‘match’ is reported, but it isn’t really a match. Examples of this often include overly simplistic matching, such as based on telephone numbers, or company names. This can lead to every record with the same number being recorded as the same entity. Sometimes this is correct, but often a centralised 0800 number masks a large, geographically diverse organisation.

This is important because the purpose of matching is to enable the update of your data with other information ‘where the two data elements match’. If you change every ‘matching record’ to be the same you seriously risk damaging the value of your data.

Whilst a few false positives in a prospecting list may be forgiveable, ‘updating’ your customer data based on a bad match could cost a fortune.

How to ensure good results

We suggest you shouldn’t take too much for granted. You should understand, in broad terms, what is being done and why. Any good service provider will talk you through what they are going to do. You should always undertake a test of the data cleansing process on a subset of your data.

  • Find several records you know are no longer accurate.
  • Make up a data set with other records to a suitable test size, say 100 records.
  • Ask for a test match and update.
  • Look at the results and measure the updates.
  • Check accuracy against known changes.
  • Check the number of records with any change.
  • In case of doubt, sample a few changes either by phone or on a website to determine accuracy.

Summary

So, data cleansing isn't voodoo at all. If you need to do this for yourself, follow these steps:

  • Standardise your data - carefully, and where it doesn't do any harm
  • Remove duplication - order your lists to highlight similar records and remove unwanted entries
  • Update data - contact companies, or research the information online and update information
  • Remove inaccurate data - such as companies that have ceased trading or decision makers who have retired
  • ... or contact Corpdata, we can do it for you, save you lots of hassle and provide excellent results.

Details, Details: About Corpdata matching systems

It’s magic ...

... voodoo, very technical. I could tell you, but I’d have to kill you, and you wouldn’t understand anyway. But it’s really good. Trust me!

... and for those who need a few more details

The Corpdata matching system is bespoke and features no ‘off the shelf’ components.

It was developed by our in-house team of software engineers and data scientists, having tested every leading ‘identity resolution’ technique and algorithm. Many were rejected because they were too specialised or suited problems faced in other countries. Phonetic techniques are used, including those refined for matching family names. Where these have proved inadequate or insufficient, bespoke routines have been created to fill the remaining gaps, including using DNA or gene sequencing techniques, which were designed to handle ‘errors’ (genetic mutations), or ‘differences’, in the data.

The final output has been hand-tuned to return the highest number of genuine matches while returning almost no false-positives. We DO NOT report a match where ambiguity remains, to avoid the possibility of ‘harming’ the data.

(By the way, the same system is used for identifying possible duplicates within and between data sets, sometimes called ‘de-duping’ or ‘merge-purge’, but this is tuned to identify all records where there is a likelihood of duplication).

The core development took over 6 man years and builds upon an additional 3 year Knowledge Transfer Partnership undertaken in conjunction with Plymouth University. Ongoing development and tuning are kept under frequent review.

The pre-processing of the data for matching comprises of up to 73 data cleansing and standardisation routines for each record. These include standardising differences, such as between ‘&’ and ‘and’, and normalising telephone numbers to remove internationalisation and formatting, such as brackets. Each supplied record is compared to each ‘reference’ record by up to 37 matching techniques, working from the most absolute to the most ‘fuzzy’.

We also designed the system with performance in mind. The Corpdata Matching System enables us to accurately match a 1 million record ‘supplied’ data set against our 2.4 million ‘reference’ data set in as little as 107 seconds.

To achieve this feat, the matching system is deployed on a dedicated self-healing cluster of 16 blade servers all connected to a dedicated matching back-end database cluster.

(With GDPR in mind we have introduced direct upload of files for matching and de-duping. This means personal data is not handled by those with no NEED to do so. Using Corpdata, you can be assured your legal compliance, and personal data protection have been considered. You can also demonstrate how you comply.)

Corpdata are happy to help

If you have any questions or feel you need a bit more guidance, please feel free to call us on (01626) 777400, we are always happy to help. You have nothing to lose.

Want more like this?

Want more like this?

Insight delivered to your inbox

Keep up to date with our free email. Hand picked whitepapers and posts from our blog, as well as exclusive videos and webinar invitations keep our Users one step ahead.

By clicking 'SIGN UP', you agree to our Terms of Use and Privacy Policy

side image splash

By clicking 'SIGN UP', you agree to our Terms of Use and Privacy Policy