Amazon Data Firehose [MLA-C01]

Intro to Data Firehose from SAA-C03 Notes

Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45356757

Possible to convert data:
- from JSON to Parquet/ORC (only for S3)
- from CSV to JSON (with Lambda)
Supports compression when target is S3 (GZIP, ZIP and SNAPPY)
- If data is further loaded into Redshift, only GZIP is valid
❗ Spark & KCL do NOT read from Firehose!
Data is never dropped
- Transformation or delivery failures usually registered in S3 (under a different bucket generally)
- Optionally, all source data can also be uploaded to S3

Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45356757

Records accumulated in a buffer until buffer is flushed
- ‼️ Recent change: Firehose now allows to disable the buffer! (buffer size = 0)
Buffer will be flushed if any of two limits is reached:
- Buffer size (e.g. 32MB)
  - Limit reached if there's high throughput
  - If high throughput, Firehose can automatically increase buffer size (autoscaling)
- Buffer time (e.g. 60s)
  - Limit reached if there's low throughput

Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45356757

Kinesis Data Streams	Data Firehose
Streaming service for ingesting data at scale	Data load/delivery service to destinations
Write custom code (producer/consumer)	Fully-managed
RT (~200 ms latency for standard, ~70 ms latency for enhanced fan-out)	Near-RT (60s or more). NOTE: Buffer can be disabled, making it RT
Managed scaling (Shards)	Autoscaling, Serverless
Data window (1-365 days)	No additional data storage than what's loaded into destinations
Replay capability	No replay capability