Blogs · Data Engineering · Big Data

A Fundamental Course for Data Engineering

A practical introduction to data engineering fundamentals: ingestion, storage, batch and streaming processing, orchestration, data quality, governance, and serving.

2020.05.01 · 2 min read · by Zhenlin Wang

Introduction

Data engineering builds the systems that move data from source systems to useful destinations: analytics tables, machine learning datasets, dashboards, search indexes, and product features.

A good data pipeline is reliable, observable, versioned, and understandable. It should not require heroic debugging every time a source changes.

Core Responsibilities

Data engineering usually includes:

Ingestion

Ingestion moves data into the platform.

Patterns:

Important questions:

Storage

Common storage layers:

Storage choices include warehouses, data lakes, lakehouses, databases, object storage, and search systems. The right choice depends on query patterns, cost, latency, governance, and scale.

Batch Processing

Batch processing handles data in scheduled jobs. It is useful for:

Batch jobs should be repeatable. If a job is rerun for a date, the output should be predictable.

Streaming Processing

Streaming handles events as they arrive.

Use it for:

Streaming systems need careful handling of ordering, duplicates, late events, and state.

Orchestration

Orchestration tools schedule and connect pipeline steps.

They should answer:

Airflow, Dagster, Prefect, and cloud-native orchestrators all solve variants of this problem.

Data Quality

Validate data at multiple points:

Bad data should fail loudly when it threatens downstream users.

Governance and Lineage

As systems grow, teams need to know:

Governance is not paperwork when the data powers decisions or models. It is operational safety.

Closing

Data engineering is the foundation for analytics and machine learning. The best pipelines are not the most complex ones; they are the ones that can be rerun, trusted, monitored, and explained.