Intro on Data Firehose from Cantrill’s SAA-C03
Data Firehose
Additional Concepts
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45356757
- Possible to convert data:
- from JSON to Parquet/ORC (only for S3)
- from CSV to JSON (with Lambda)
- Supports compression when target is S3 (GZIP, ZIP and SNAPPY)
- If data is further loaded into Redshift, only GZIP is valid
- ❗ Spark & KCL do NOT read from Firehose!
- Data is never dropped
- Transformation or delivery failures usually registered in S3 (under a different bucket generally)
- Optionally, all source data can also be uploaded to S3
- Data Firehose Architecture Diagram
Firehose Buffer Sizing
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45356757
- Records accumulated in a buffer until buffer is flushed
- ‼️ Recent change: Firehose now allows to disable the buffer! (buffer size = 0)
- Buffer will be flushed if any of two limits is reached:
- Buffer size (e.g. 32MB)
- Limit reached if there's high throughput
- If high throughput, Firehose can automatically increase buffer size (autoscaling)
- Buffer time (e.g. 60s)
- Limit reached if there's low throughput
Kinesis Data Streams vs Data Firehose