Apache Hadoop and Apache Spark

Apache Hadoop

Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45284871

🔧 Framework for distributed processing of large data sets across clusters of computers, using simple programming models
- Hadoop cluster = deployment of Hadoop
- Written in Java
- Open-source
- Ref: https://hadoop.apache.org/

Hadoop Common/Core – libraries & utilities used by modules
- other modules run on top of Hadoop Common
Hadoop Distributed File System (HDFS) – data storage
- distributed → scalable & resilient (data replicated across nodes in the cluster)
- very high aggregate bandwidth across cluster → very fast
YARN – resource negotiator
- managing & scheduling cluster resources
MapReduce – large-scale data processing
- Parallel processing
- Reliable, FT
- Consists of:
  - Mapper functions: maps data to sets of K-V pairs
    - transform/reformat data
  - Reduce functions that you write in code
    - aggregate data to final form

Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45284871

🔧 Open-source distributed processing system for big data workloads
- ❗ Can be used on top of HDFS thanks to YARN, replacing MapReduce
❗ Faster alternative to MapReduce
- in-memory caching, optimized query execution & directed acyclic graph (DAG)
Use cases: transform data as it comes in → RT stream processing, ML, interactive SQL…
- 💡generally NOT used for OLTP or batch processing jobs.