3 Reasons Data Cleansing is Critical in Data Analysis

No, not that sort of cleansing…

Introduction

Today I want to discuss data cleansing, or data cleaning as it is also known. It’s a widely known, yet in my opinion underused step in data analysis projects, as well as being an underutilised step in ETL transformation logic. Because of this, many datasets suffer from poor performance and poor accuracy.

Let’s get into more detail.

Reason 1 – Data Accuracy

Let’s firstly list some of the issues in raw data that can affect the accuracy of any analysis or model that is generated from it:

  • Duplicated data
  • Missing values
  • Outdated information
  • Incorrectly entered data
  • Inconsistent formatting
  • Etc.

You get the idea.

These issues can skew a dataset massively. If you’re basing your company’s decisions on one figure, but that figure turns out to be massively wrong, you and your company could be on a completely wrong path.

Which leads me into my next point…

Reason 2 – Building Trust and Confidence

Imagine, in the above example, that you’d influenced business decisions with your data model. You’d suggested actions based on data that was full of incorrect fields and duplicate entries. This was later found out by the whole company.

A couple months afterwards, you’d suggested another critical decision for the business to take based on some other findings that you’d created.

Do you think those decision makers would trust your work after last time?

Trust is difficult to maintain as a Data Team. Executives and decisions makers are notoriously blind to the benefits of data in their company at the best of times. Giving them reason to have even less confidence in you is the wrong path.

Make yourself trustworthy, clean your data, and reap the benefits.

Reason 3 – Spend Time Cleansing, To Spend Less Time Cleansing

Hear me out, because this may sound backwards at first.

You don’t actually enjoy cleansing your data, right?

Nobody does. Not really. They want to be getting on with the fun stuff. So how do we get to spend less time doing it?

By doing it as our first step and doing a thorough job.

This way, we don’t have to keep coming back to this step time and time again when we’ve found data that looks wrong. We’ve already sorted it. Time to kick back and enjoy the more interesting parts of your work.

Summary

My summary today is simple. Data cleansing is a critical part of any work in data, whether that be analysis, engineering, modelling or anything else you can think of.

It may just be the most important step.

Would you agree?

One response to “3 Reasons Data Cleansing is Critical in Data Analysis”

  1. […] Python provides various tools and libraries to effectively handle missing data, allowing you to clean, pre-process, and analyse datasets with ease. In this beginner’s guide, we’ll explore […]

    Like

Leave a reply to A Beginner’s Guide to Handling Missing Values in Python – Jammos Analytics Cancel reply