Blogs · Spark · Data Engineering

Apache Spark: Only the Simple Answer

A simple explanation of Apache Spark: distributed data processing, DataFrames, lazy evaluation, transformations, actions, partitioning, and common performance mistakes.

2020.04.13 · 2 min read · by Zhenlin Wang

Introduction

Apache Spark is a distributed processing engine for large-scale data workloads. It lets you process data across many machines using APIs in Python, SQL, Scala, Java, and R.

Use Spark when the data or computation is too large for one machine, or when the pipeline already lives in a distributed data platform.

The Basic Model

Spark splits work across a cluster.

Key pieces:

Spark is lazily evaluated. Transformations do not run immediately; Spark builds a plan and executes it when an action is called.

DataFrames

Spark DataFrames are the main API for structured data.

df = spark.read.parquet("s3://bucket/events/")

result = (
    df.filter(df.event_type == "purchase")
      .groupBy("country")
      .count()
)

result.show()

The filter and groupBy calls are transformations. The show call is an action.

Transformations and Actions

Common transformations:

Common actions:

Be careful with collect(). It brings data back to the driver and can crash the job if the dataset is large.

Partitioning

Partitioning affects parallelism and shuffle cost.

Too few partitions:

Too many partitions:

Partition by fields commonly used for filtering, such as date, when it matches the workload.

Shuffles

A shuffle moves data across the network. Joins, group-bys, and repartitions often cause shuffles.

Shuffles are expensive because they involve disk, network, and coordination. Many Spark performance problems are really shuffle problems.

Reduce shuffle cost by:

Common Mistakes

When Not to Use Spark

Spark is not always the answer.

Do not use it when:

Spark is powerful, but it is a distributed system. Use it when you need distributed processing.