Blogs · Distributed System · Cloud Computing · Software Engineering · Big Data

Apache Spark: Only the simple answer

Some conceptual understanding of spark and big data. Honestly, this just scratches the surface

2020.04.13 · 5 min read · by Zhenlin Wang · updated 2022-04-17

Overview

In this post, I’m just gonna discuss some fundational things I learned about big data with Apache Spark. Personally, I’m just a bit interested in this topic, and do not aim to really become a big data professional (not yet~). It does take tremendous effort to learn Spark well, not to mention the entire big data ecosystem. I’ll update this post if I try out some new projects that really apply Spark and its APIs in a deep manner, but for now, let’s just talk about some basics of Spark.

Apache Spark vs Hadoop MapReduce

Apache SparkMapReduce
Processing TypeProcess in batches and in real-timeProcess in batches only
Speednearly 100x fasterslower due to large scale data processing
Storagestore data in RAM i.e. in-memeory (easier to retrieve)Store in HDFS, longer time to retrieve
Memory dependencecaching (for RDD) and in-memory data storagedisk-dependent

Important Components of Spark Ecosystem

RDD

How Spark runs applications with the help of its architecture

START EXECUTION

END EXECUTION

What is a Parquet file and what are its advantages

What is shuffling in Spark? When does it occur?

Notes on Big Data Learning Journey (for those who truly want a Big Data job and for my future )

To excel in the Big Data domain, you should master the following skills:

  1. Java & Scala $\implies$ Understand source code for related package/API development
  2. Linux $\implies$ Everyone should know about shell scripts, bash and linux commands
  3. Hadoop $\implies$ It’s a broad topic, but first of all, the ability to read whatever source code for an API is a must
  4. Hive $\implies$ Know how to use it, understand how the SQL is converted in base code and how to optimize the query process or MapReduce/Spark Operations
  5. Spark $\implies$ The core developement process (But honestly, most of the time it is still SQL)
  6. Kafka $\implies$ High-volumn stream data processing; Good to use when you have high concurrency
  7. Flink $\implies$ Faster than Spark sometimes. However, you should not discard Spark. Learn based on what you need.
  8. HBase $\implies$ Know your database knowledge. Understand its fundamental knowledge
  9. Zookeeper $\implies$ Distributation cluster data coordination services; Know how to use, better to understand the basic
  10. YARN $\implies$ Cluster resources management; Know how to use
  11. Sqoop, Flume, Oozie/Azkaban $\implies$ Know how to use

Different cluster managers

  1. Spark Standalone mode
    • by default, applications submitted to the standalone mode cluster will run in FIFO order, and each application will try to use all available nodes
  2. Apache Mesos
    • an open sources project to manage computer clusters, and can also run Hadoop applications
  3. YARN
    • Apache YARN is the cluster resource manager of Hadoop 2.
  4. Kubernetes
    • an open-source system for automating deployment, scaling and management of containerized applications