Blogs · Data Mining · Data Cleaning

The Data Mining Trilogy II: Cleaning

A practical guide to cleaning data: missing values, duplicates, outliers, inconsistent categories, schema checks, and reproducible cleaning pipelines.

2019.09.01 · 2 min read · by Zhenlin Wang

Introduction

Data cleaning turns raw data into data that is consistent enough to analyze or model. The goal is not to make data perfect. The goal is to make data trustworthy for the task.

Cleaning should be reproducible. If a notebook cell fixes a problem by hand, move that logic into a pipeline before the project becomes important.

Missing Values

First ask why the value is missing.

Possible causes:

Treatment options:

Do not impute automatically without understanding the cause.

Duplicates

Duplicates can come from retries, joins, scraping, batch replays, or logging bugs.

Check:

Deduplication rules should be explicit. Keep the most recent record, the first record, or an aggregated record only when the rule matches the task.

Outliers

Outliers can be errors or real rare events.

Examples:

Use domain rules, plots, and source checks before deleting outliers. For modeling, consider robust scaling, clipping, winsorization, or models less sensitive to extreme values.

Inconsistent Categories

Categorical values often drift:

Normalize categories with explicit maps. For production systems, define behavior for unknown categories.

Schema Validation

Validate the data shape:

Schema checks should run every time the pipeline runs.

Cleaning Without Leakage

Fit cleaning transforms only on training data when they learn from distributions.

Examples:

Then apply the fitted transform to validation, test, and production data.

Cleaning Checklist

Before modeling:

Clean data is not data with no problems. It is data whose problems are visible and controlled.