Amazon Elastic MapReduce (EMR)

Amazon Elastic Map Reduce (EMR)

🔧 AWS-managed Hadoop framework on EC2 instances
- Includes Spark, HBase, Presto, Flink, Hive…
- ‼️ Much more than just MapReduce or just a Hadoop cluster in AWS!
  - 💡 obsolete/weird naming, MapReduce is quite obsolete…

Master node: manages cluster (single EC2 instance)
- Run jobs in EMR by connecting to master, or by submitting ordered steps via console
Core node(s): Hosts HDFS data & runs tasks
- Can be scaled up & down, but with some risk of data loss
Task node(s): Runs tasks, does not host data
- No risk of data loss when removing
- Good use of spot instances
Diagram

EMR Serverless → AWS scales nodes automatically
Persistence
- Transient clusters: terminated when job is finished
  - spin up task nodes that use spot instances
- Long-running clusters: interact with apps directly
  - e.g. experimenting with datasets, ad-hoc queries…
  - can use reserved instances (save $)
Integration with AWS suite (VPC, S3 for data I/O, CW, AWS Data pipeline…)
Storage options
- HDFS (default) → distribute data across nodes
  - Block size default = 128MB
  - ephemeral → very fast, but will disappear when cluster is terminated
- EMRFS: access S3 as if it were HDFS → can use S3 instead of HDFS
  - still pretty fast, also persistent
  - EMRFS Consistent View – Optional for S3 consistency
  - Uses DDB to track consistency
- Local FS (ephemeral, not distributed)
- EBS for HDFS (backs up HDFS data)
EMR promises
- Possible to resize running cluster's core nodes
- Automatic reprovisioning of core nodes that fail
- Task nodes can be added/removed on the fly
Billing:
- Hourly charge for EMR
- Plus EC2 charges
EMR Notebook: similar to Zeppelin, but more AWS integration
- Hosted in VPC, backups to S3…
- Can provision EMR clusters seamlessly