Amazon Elastic Map Reduce (EMR)
- 🔧 AWS-managed Hadoop framework on EC2 instances
- Includes Spark, HBase, Presto, Flink, Hive…
- ‼️ Much more than just MapReduce or just a Hadoop cluster in AWS!
- 💡 obsolete/weird naming, MapReduce is quite obsolete…
EMR cluster architecture
- Master node: manages cluster (single EC2 instance)
- Run jobs in EMR by connecting to master, or by submitting ordered steps via console
- Core node(s): Hosts HDFS data & runs tasks
- Can be scaled up & down, but with some risk of data loss
- Task node(s): Runs tasks, does not host data
- No risk of data loss when removing
- Good use of spot instances
- Diagram
EMR features
- EMR Serverless → AWS scales nodes automatically
- Persistence
- Transient clusters: terminated when job is finished
- spin up task nodes that use spot instances
- Long-running clusters: interact with apps directly
- e.g. experimenting with datasets, ad-hoc queries…
- can use reserved instances (save $)
- Integration with AWS suite (VPC, S3 for data I/O, CW, AWS Data pipeline…)
- Storage options
- HDFS (default) → distribute data across nodes
- Block size default = 128MB
- ephemeral → very fast, but will disappear when cluster is terminated
- EMRFS: access S3 as if it were HDFS → can use S3 instead of HDFS
- still pretty fast, also persistent
- EMRFS Consistent View – Optional for S3 consistency
- Uses DDB to track consistency
- Local FS (ephemeral, not distributed)
- EBS for HDFS (backs up HDFS data)
- EMR promises
- Possible to resize running cluster's core nodes
- Automatic reprovisioning of core nodes that fail
- Task nodes can be added/removed on the fly
- Billing:
- Hourly charge for EMR
- Plus EC2 charges
- EMR Notebook: similar to Zeppelin, but more AWS integration
- Hosted in VPC, backups to S3…
- Can provision EMR clusters seamlessly
Choosing instance types for EMR
- Master node:
m4.large
for <50 nodes, m4.xlarge
for >50 nodes