Blogs · Data Mining · EDA

The Data Mining Trilogy III: Analysis

A practical guide to exploratory data analysis: distributions, relationships, missingness, outliers, leakage checks, segmentation, and communicating findings.

2020.09.03 · 2 min read · by Zhenlin Wang

Introduction

Exploratory data analysis (EDA) is the process of understanding data before making strong claims or training models.

Good EDA answers:

EDA is not a gallery of plots. It is structured curiosity.

Start With Shape and Schema

Check:

These basics often reveal the biggest issues.

Study Distributions

For numeric variables:

For categorical variables:

Look for impossible values, heavy tails, and categories that should be merged.

Analyze Relationships

Useful checks:

Do not overinterpret correlation. EDA suggests hypotheses; it does not prove causality.

Check Leakage

EDA is a good time to catch leakage.

Warning signs:

If a feature looks too good, investigate.

Segment the Data

Aggregate plots can hide important differences.

Segment by:

These slices often become validation slices later.

Communicate Findings

A useful EDA report should include:

Keep the report short enough for a teammate to act on it.

Closing

EDA is where you build judgment about the dataset. The output should not just be plots; it should be decisions about cleaning, validation, feature engineering, and modeling.