
Introduction
Today I want to discuss data cleansing, or data cleaning as it is also known. It’s a widely known, yet in my opinion underused step in data analysis projects, as well as being an underutilised step in ETL transformation logic. Because of this, many datasets suffer from poor performance and poor accuracy.
Let’s get into more detail.
Reason 1 – Data Accuracy
Let’s firstly list some of the issues in raw data that can affect the accuracy of any analysis or model that is generated from it:
- Duplicated data
- Missing values
- Outdated information
- Incorrectly entered data
- Inconsistent formatting
- Etc.
You get the idea.
These issues can skew a dataset massively. If you’re basing your company’s decisions on one figure, but that figure turns out to be massively wrong, you and your company could be on a completely wrong path.
Which leads me into my next point…
Reason 2 – Building Trust and Confidence
Imagine, in the above example, that you’d influenced business decisions with your data model. You’d suggested actions based on data that was full of incorrect fields and duplicate entries. This was later found out by the whole company.
A couple months afterwards, you’d suggested another critical decision for the business to take based on some other findings that you’d created.
Do you think those decision makers would trust your work after last time?
Trust is difficult to maintain as a Data Team. Executives and decisions makers are notoriously blind to the benefits of data in their company at the best of times. Giving them reason to have even less confidence in you is the wrong path.
Make yourself trustworthy, clean your data, and reap the benefits.
Reason 3 – Spend Time Cleansing, To Spend Less Time Cleansing
Hear me out, because this may sound backwards at first.
You don’t actually enjoy cleansing your data, right?
Nobody does. Not really. They want to be getting on with the fun stuff. So how do we get to spend less time doing it?
By doing it as our first step and doing a thorough job.
This way, we don’t have to keep coming back to this step time and time again when we’ve found data that looks wrong. We’ve already sorted it. Time to kick back and enjoy the more interesting parts of your work.
Summary
My summary today is simple. Data cleansing is a critical part of any work in data, whether that be analysis, engineering, modelling or anything else you can think of.
It may just be the most important step.
Would you agree?
Leave a comment