AWS Glue and AWS Lake Formation

AWS Glue

🔧 Two main functions:
1. Serverless cataloging: discovery & definition of schemas & tables within data
2. Serverless ETL: Feature engineering on raw data, then load to target
  - 🔧 Runs Apache Spark under the hood

Crawlers scan data source (e.g. S3), create schema → data catalog
- ‼️ Catalogs only store schema/metadata**, original data stays in S3!**
- Once cataloged, unstructured data can be treated like structured (can be read by Athena, EMR…)
- 💡 AWS Glue acts as the “glue” between unstructured data in data store and services that analyze structured data
❗ Glue crawler extracts partitions based on how S3 data is organized!
- ‼️ Think upfront how the data lake in S3 will be queried! Some structure in the unstructured data is desired!
- e.g. if querying primarily by device, store data as device/yyyy/mm/dd instead of yyyy/mm/dd/device
Apache Hive integration: run SQL-like queries from EMR
- Glue Data Catalog ↔ Hive metastore
  - import & export
💰 Billing
- Per-second billing for crawlers
- Data Catalog free tier: First million objects stored & accesses are free

Feature engineering on raw data, then load to target (S3, JDBC, even Glue Data Catalog)
- Raw data can be in S3, JDBC or streaming (Kinesis or Kafka, runs on Apache Spark Structured Streaming)
Run on serverless Apache Spark platform → automatic Python or Scala code generation
- Performance of underlying Spark jobs based on amount of DPUs (data processing units)
Execution
- Can be on-demand
- Can be event-driven (Glue Scheduler or Glue Triggers)
- Job bookmarks → persist state from job run, prevent reprocessing of old data
  - ❗ only handles new rows, not updated rows!
- Failure/success can be sent to EventBridge/CW
DynamicFrame: collection of DynamicRecords (self-describing, have a schema)
Transformations
- Bundled transformations (SQL-like): DropFields, DropNullFields, Filter, Join, Map…
- ML transformations (e.g. FindMatches ML → identify duplicates/matching records)
- Format conversions: CSV, JSON, Avro, Parquet, ORC, XML
- Apache Spark transformations (e.g. K-Means)
- ResolveChoice → deal with ambiguities
ETL scripts can update Data Catalog schemas & partitions if necessary
Development endpoints (for external notebooks)
- Develop ETL scripts using a notebook, then create ETL job in Glue that runs the script
- Endpoint inside a VPC → allows access from Apache Zeppelin, SageMaker notebook…
💰 Billing
- Per-second billing for ETL jobs
- Development endpoints for developing ETL code charged by the minute