AWS Glue
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45285225
- 🔧 Two main functions:
- Serverless cataloging: discovery & definition of schemas & tables within data
- Serverless ETL: Feature engineering on raw data, then load to target
- 🔧 Runs Apache Spark under the hood
Glue Crawlers & Data Catalogs
- Crawlers scan data source (e.g. S3), create schema → data catalog
- ‼️ Catalogs only store schema/metadata**, original data stays in S3!**
- Once cataloged, unstructured data can be treated like structured (can be read by Athena, EMR…)
- 💡 AWS Glue acts as the “glue” between unstructured data in data store and services that analyze structured data
- âť—Â Glue crawler extracts partitions based on how S3 data is organized!
- ‼️ Think upfront how the data lake in S3 will be queried! Some structure in the unstructured data is desired!
- e.g. if querying primarily by device, store data as
device/yyyy/mm/dd
instead of yyyy/mm/dd/device
- Apache Hive integration: run SQL-like queries from EMR
- Glue Data Catalog ↔ Hive metastore
- đź’°Â Billing
- Per-second billing for crawlers
- Data Catalog free tier: First million objects stored & accesses are free
Glue ETL
- Feature engineering on raw data, then load to target (S3, JDBC, even Glue Data Catalog)
- Raw data can be in S3, JDBC or streaming (Kinesis or Kafka, runs on Apache Spark Structured Streaming)
- Run on serverless Apache Spark platform → automatic Python or Scala code generation
- Performance of underlying Spark jobs based on amount of DPUs (data processing units)
- Execution
- Can be on-demand
- Can be event-driven (Glue Scheduler or Glue Triggers)
- Job bookmarks → persist state from job run, prevent reprocessing of old data
- âť—Â only handles new rows, not updated rows!
- Failure/success can be sent to EventBridge/CW
- DynamicFrame: collection of DynamicRecords (self-describing, have a schema)
- Transformations
- Bundled transformations (SQL-like):
DropFields
, DropNullFields
, Filter
, Join
, Map
…
- ML transformations (e.g.
FindMatches ML
→ identify duplicates/matching records)
- Format conversions: CSV, JSON, Avro, Parquet, ORC, XML
- Apache Spark transformations (e.g. K-Means)
- ResolveChoice → deal with ambiguities
- ETL scripts can update Data Catalog schemas & partitions if necessary
- Development endpoints (for external notebooks)
- Develop ETL scripts using a notebook, then create ETL job in Glue that runs the script
- Endpoint inside a VPC → allows access from Apache Zeppelin, SageMaker notebook…
- đź’°Â Billing
- Per-second billing for ETL jobs
- Development endpoints for developing ETL code charged by the minute