Ref: https://learn.cantrill.io/courses/1820301/lectures/41301410 [ASSOCIATESHARED]
- 💡 NOTE: This service used to be called Amazon Kinesis Data Firehose, being a member of the Kinesis family of products. It is however no longer a member of the Kinesis family.
Data Firehose - Basic Concepts
- 🔧 Fully-managed data delivery service
- Loads/persists data into data lakes, data stores and analytics services
- 💡 By default, Kinesis Data Streams does NOT offer data persistence
- Data records available while inside rolling data window, but once outside they expire and are gone
- Firehose can connect to a Kinesis Data Stream and load it elsewhere, persisting it beyond the rolling window
- Features
- Fully serverless, regionally resilient
- Automatic scaling
- âť—Â Unlike Kinesis Data Streams where there is managed shard scaling!
- ‼️ NEAR real-time delivery of data to destination (~60s)!
- âť—Â NOT real-time like e.g. Kinesis Data Streams (~200ms)!!!
- Can transform data on the fly with Lambda
- Can add latency, depending on complexity of processing
- Billing: pay-as-you-go
- Billed based on volume of data passed through (quite cost effective)
- Use cases
- Load data into supported destinations
- Persist data of a Kinesis Data Stream after it exits rolling window
- Store data in a different format (Lambda can transform it)
Data Firehose - Architecture
- Many supported producers
- AWS services (CWLogs, CWEvents…)
- IoT devices
- Kinesis Data Streams
- Kinesis producers (KPL, Kinesis Agent…)
- If no need for streaming features from Kinesis Data Streams, send directly to Firehose
- âť—Â Supported destinations
- HTTP endpoints → can deliver to 3rd party providers
- Splunk
- S3
- Redshift
- ElasticSearch
- 💡 Learn the destinations well, useful for exam!
- ‼️ Data received in real time (~200ms), but stored in near-real time (~60s)!
- âť—Â Data buffered for delivery
- Default: Firehose waits for 1MB or 60s of data before delivering it
- ‼️ NOTE: recently, AWS has added the option to disable the buffer → Data Firehose can nowadays deliver in real time too! (although buffer is still the default)
- Support for Lambda Transformation functions
- Can be generated from blueprints (perform common tasks)
- âť—Â Can add latency to Firehose delivery (depending on complexity)
- Can optionally store raw/unmodified data in an S3 Backup Bucket
- Direct data delivery to all destinations… except for Redshift!
- ❗ Redshift uses S3 as intermediary → Firehose stores data in an S3 bucket, then Redshift loads data from bucket with Redshift COPY
- Managed & automatic process, but be aware that an S3 bucket is used
- Data Firehose Architecture Diagram