AWS Lake Formation

AWS Lake Formation - Mini deep-dive

Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/46730185

🔧 Manage Data Lakes in S3
- Can set up a data lake securely and fast (in days → that's actually fast!)
- Built on top of AWS Glue → Whatever Glue/Spark can do, Lake Formation can also do
- Architecture Diagram
Features that facilitate creating and managing data lakes:
- Loading data & monitoring data flows (data crawlers)
- Setting up partitions
- Encryption & managing keys
- Defining transformation jobs & monitoring them
  - e.g. data cleaning, imputing missing data…
- Access control (ACLs)
- Auditing
Sources can be very different and varied (S3, RDBMS, NoSQL…)
- They can be on-premises or cloud
Data lakes can be read directly by Athena, Redshift (Spectrum), and EMR
- ‼️ Manifests in Athena or Redshift queries are NOT supported!!
Pricing
- No cost for Lake Formation itself
- Billed for underlying services (Glue, S3, EMR, Athena, Redshift…)

Create an IAM user for Data Analyst
- Data Analyst will build and use the lake
Create AWS Glue connection to your data source(s)
Create S3 bucket for data lake
Register S3 bucket path in Lake Formation, grant permissions
Create DB in Lake Formation for data catalog, grant permissions
Use a blueprint for a workflow (e.g. making DB snapshots)
Run the workflow
Grant SELECT permissions to whoever reads lake (Athena, Redshift Spectrum, etc)

🔧 Support for ACID transactions across multiple tables in S3
- Allow concurrent row-based modifications without stomping each other
- 💡 For ACID transactions one traditionally needed SQL DBs, but that trade-off is no longer there and we can use ACID transactions with data lakes!