Apache Hadoop
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45284871
- 🔧 Framework for distributed processing of large data sets across clusters of computers, using simple programming models
Hadoop Modules
- Hadoop Common/Core – libraries & utilities used by modules
- other modules run on top of Hadoop Common
- Hadoop Distributed File System (HDFS) – data storage
- distributed → scalable & resilient (data replicated across nodes in the cluster)
- very high aggregate bandwidth across cluster → very fast
- YARN – resource negotiator
- managing & scheduling cluster resources
- MapReduce – large-scale data processing
- Parallel processing
- Reliable, FT
- Consists of:
- Mapper functions: maps data to sets of K-V pairs
- Reduce functions that you write in code
- aggregate data to final form
Apache Spark
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45284871
- 🔧 Open-source distributed processing system for big data workloads
- âť—Â Can be used on top of HDFS thanks to YARN, replacing MapReduce
- âť—Â Faster alternative to MapReduce
- in-memory caching, optimized query execution & directed acyclic graph (DAG)
- Use cases: transform data as it comes in → RT stream processing, ML, interactive SQL…
- đź’ˇgenerally NOT used for OLTP or batch processing jobs.
How Spark Works
