Apache Hadoop
- 🔧 Framework for distributed processing of large data sets across clusters of computers, using simple programming models
Hadoop Modules
- Hadoop Common/Core – libraries & utilities used by modules
- other modules run on top of Hadoop Common
- Hadoop Distributed File System (HDFS) – data storage
- distributed → scalable & resilient (data replicated across nodes in the cluster)
- very high aggregate BW across cluster → very fast
- YARN – resource negotiator
- managing & scheduling cluster resources
- MapReduce – large-scale data processing
- Parallel processing
- Reliable, FT
- Consists of:
- Mapper functions: maps data to sets of KV pairs
- Reduce functions that you write in code
- aggregate data to final form
Apache Spark
- 🔧 open-source distributed processing system for big data workloads
- âť—Â Can be used on top of HDFS thanks to YARN, replacing MapReduce
- âť—Â Faster alternative to MapReduce
- in-memory caching, optimized query execution & directed acyclic graph (DAG)
- Use cases: transform data as it comes in → stream processing, ML, interactive SQL…
- đź’ˇgenerally NOT used for OLTP or batch processing jobs.
How Spark Works
![image.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/6e1bc759-5181-447f-864f-016128e79569/b5bb7e2b-847b-4929-b86d-fac757fe0099/image.png)
- Driver: coordinates Spark processes in cluster via context object
- Contains jobs/tasks
- Connects to resource negotiators like YARN
- Executors: run Spark jobs/tasks in cluster nodes
Spark Components