AWS Glue 101 | Notion

Ref: https://learn.cantrill.io/courses/1820301/lectures/42197048

AWS Glue - Overview

🔧 Serverless ETL (Extract, Transform & Load) & Serverless data catalogs
- …vs AWS Data Pipeline
  - Can do ETL, but uses compute servers (specifically, creates Amazon EMR clusters)
  - 💡 In exam, choose Glue if you need a serverless, ad-hoc, or cost-effective ETL solution
Two main functionalities
1. Moves & transforms data between data source(s) & data destination(s)
2. Crawls data sources & generates data catalogs
Fully-managed by AWS
- Serverless, public, regionally resilient service
- Scales automatically

🔧 Data Catalog = collection of metadata combined with data management & search tools
- AWS Glue data catalogs store persistent metadata about data sources in a region
❗ Glue provides one catalog per region & per account
- Helps avoid data silos → improves visibility of data across an account
  - Metadata can be browsed & brought into other systems via Glue's ETL features
Various data related products can use Glue data catalogs & ETL features:
- Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, AWS Lake Formation….
Data crawlers discover data
- Crawlers configured with credentials to read data sources
- Crawlers discover the data (tables, schemas) & store metadata in data catalog

Supported data sources:
- Data stores (e.g. S3)
- DBs: RDS, JDBC compatible DBs (Redshift, DynamoDB…)
- Streams (Kinesis Data Streams, Apache Kafka…)
Supported data targets:
- Data stores (e.g. S3)
- DBs: RDS, JDBC compatible DBs
Data Catalogs can be accessed by Glue jobs or other systems, e.g. AWS Management Console in an AWS Organization
- Data publicized & made visible across the organization