Apache Hadoop and Apache Spark

Apache Hadoop

🔧 Framework for distributed processing of large data sets across clusters of computers, using simple programming models
- Hadoop cluster = deployment of Hadoop
- Written in Java
- Ref: https://hadoop.apache.org/

Hadoop Common/Core – libraries & utilities used by modules
- other modules run on top of Hadoop Common
Hadoop Distributed File System (HDFS) – data storage
- distributed → scalable & resilient (data replicated across nodes in the cluster)
- very high aggregate BW across cluster → very fast
YARN – resource negotiator
- managing & scheduling cluster resources
MapReduce – large-scale data processing
- Parallel processing
- Reliable, FT
- Consists of:
  - Mapper functions: maps data to sets of KV pairs
    - transform/reformat data
  - Reduce functions that you write in code
    - aggregate data to final form

🔧 open-source distributed processing system for big data workloads
- ❗ Can be used on top of HDFS thanks to YARN, replacing MapReduce
❗ Faster alternative to MapReduce
- in-memory caching, optimized query execution & directed acyclic graph (DAG)
Use cases: transform data as it comes in → stream processing, ML, interactive SQL…
- 💡generally NOT used for OLTP or batch processing jobs.

Driver: coordinates Spark processes in cluster via context object
- Contains jobs/tasks
- Connects to resource negotiators like YARN
Executors: run Spark jobs/tasks in cluster nodes