Ref: https://learn.cantrill.io/courses/1820301/lectures/41301410
- 💡 NOTE: This service used to be called Amazon Kinesis Data Firehose, being a member of the Kinesis family of products. It is however no longer a member of the Kinesis family.
Amazon Data Firehose - Basic Concepts
- đź”§Â Fully-managed data delivery service
- Loads/persists data into data lakes, data stores, and analytics services
- 💡 By default, Kinesis Data Streams does NOT offer data persistence
- Data records available while inside rolling window… but once outside they expire
- Firehose can connect to a KDS and load it elsewhere, persisting it beyond rolling window
- Features
- Fully-managed by AWS:
- Serverless, regionally resilient, public service
- Automatic scaling
- âť—Â Unlike Kinesis Data Streams where there is managed shard scaling!
- ‼️ By default, NEAR real-time delivery of data to destination (~60s)!
- âť—Â Default is NOT real-time like e.g. Kinesis Data Streams (~200ms)!
- ❗ Nowadays, Firehose buffer can be disabled → CAN do RT delivery!
- Can transform data on the fly with Lambda
- Can add latency, depending on complexity of processing
- Billing: pay-as-you-go
- Billed based on volume of data passed through (quite cost-effective!)
- Use cases
- Load data into supported destinations
- Persist data of a Kinesis Data Stream after it exits rolling window
- Store data in a different format (Lambda can transform it)
Amazon Data Firehose - Architecture

- Many supported producers
- AWS services (CWLogs, CWEvents…)
- Internet of Things (IoT) devices
- Kinesis Data Streams
- Kinesis producers (KPL, Kinesis Agent…)
- KPL = Kinesis Producer Library (Kinesis Agent is built on top of it)
- If no need for streaming features from Kinesis, just send directly to Firehose!
- Supported destinations (learn them well, useful for exam!)
- HTTP endpoints → can deliver to 3rd party providers
- Splunk
- Amazon S3
- Amazon Redshift
- Amazon OpenSearch Service
- 💡 Used to be calledÂ
Amazon Elasticsearch Service
- âť—Â Data buffered for delivery
- Default: Firehose waits for 1MB of data or 60s before flushing buffer
- ~60s is considered near real time (real time is <1s)
- ‼️ NOTE: recently, AWS has added the option to disable the buffer → Data Firehose can nowadays deliver in real time too! (although buffer is still the default)
- Support for Lambda Transformation functions
- Can be generated from blueprints (perform common tasks)
- âť—Â Can add latency to Firehose delivery (depending on complexity)
- Can optionally store raw/unmodified data in an S3 Backup Bucket
- Direct data delivery to all destinations… except for Redshift!
- ❗ Redshift uses S3 as intermediary → Firehose stores data in an S3 bucket, then Redshift loads data from bucket with Redshift COPY
- Managed & automatic process, but be aware that an S3 bucket is used