AWS Lake Formation - Mini deep-dive
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/46730185
- 🔧 Manage Data Lakes in S3
- Can set up a data lake securely and fast (in days → that's actually fast!)
- Built on top of AWS Glue → Whatever Glue/Spark can do, Lake Formation can also do
- Architecture Diagram
- Features that facilitate creating and managing data lakes:
- Loading data & monitoring data flows (data crawlers)
- Setting up partitions
- Encryption & managing keys
- Defining transformation jobs & monitoring them
- e.g. data cleaning, imputing missing data…
- Access control (ACLs)
- Auditing
- Sources can be very different and varied (S3, RDBMS, NoSQL…)
- They can be on-premises or cloud
- Data lakes can be read directly by Athena, Redshift (Spectrum) and EMR
- ‼️ Manifests in Athena or Redshift queries are NOT supported!!
- Pricing
- No cost for Lake Formation itself
- Billed for underlying services (Glue, S3, EMR, Athena, Redshift…)
Process for building a Data Lake
- Create an IAM user for Data Analyst
- Data Analyst will build and use the lake
- Create AWS Glue connection to your data source(s)
- Create S3 bucket for data lake
- Register S3 bucket path in Lake Formation, grant permissions
- Create DB in Lake Formation for data catalog, grant permissions
- Use a blueprint for a workflow (e.g. making DB snapshots)
- Run the workflow
- Grant SELECT permissions to whoever reads lake (Athena, Redshift Spectrum, etc)
Governed Tables in S3