Blogs · Data Mining · Data Preparation

The Data Mining Trilogy I: Preparation

A practical overview of data preparation: defining the problem, collecting data, building schemas, splitting datasets, and preventing leakage.

2019.08.25 · 2 min read · by Zhenlin Wang

Introduction

Data preparation is the work before modeling: define the problem, collect the data, understand its structure, and create datasets that can support valid analysis or machine learning.

Poor preparation creates downstream confusion. Good preparation makes the rest of the project measurable.

Define the Question

Start with the question:

For machine learning, the prediction time is critical. Any feature created after that time can leak future information.

Collect the Data

Record:

Do not treat data collection as a one-time download. Data sources change.

Understand Granularity

Granularity defines what one row means.

Examples:

Many bugs come from mixing granularities without noticing. For example, joining user-level features to transaction-level labels can duplicate values and distort metrics.

Create a Data Dictionary

A useful data dictionary includes:

This is simple work, but it prevents many misunderstandings.

Split the Dataset

Choose a split that matches reality:

Keep the test set untouched until final evaluation.

Prevent Leakage

Watch for:

Leakage makes models look impressive in notebooks and disappointing in production.

Preparation Checklist

Before analysis or modeling:

Data preparation is not glamorous, but it decides whether the analysis can be trusted.