AWS Glue 101 | Notion

AWS Glue - High-level overview

🔧 Serverless ETL (Extract, Transform & Load)
- …vs AWS Data Pipeline
  - Can do ETL, but uses compute servers (specifically, creates EMR clusters)
  - 💡 In exam, choose Glue if you need serverless, ad-hoc or cost-effective ETL solution
- At a high level, moves & transforms data between data source(s) & data destination(s)
- Another main functionality: crawls data sources & generates data catalogs
Data Sources:
- Stores (S3)
- DBs: RDS, JDBC compatible DBs (Redshift, DynamoDB…)
- Streams (Kinesis Data Streams & Apache Kafka)
Data Targets:
- Stores (S3)
- DBs: RDS, JDBC compatible DBs

🔧 Data Catalog = collection of metadata combined with data management & search tools
- In AWS Glue → persistent metadata about data sources in a region
❗ Glue provides one catalog per region per account
- Helps avoid data silos → improves visibility of data across an account
  - Metadata can be browsed & brought into other systems via Glue's ETL features
Various data related products can use Glue data catalogs & ETL features:
- Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, AWS Lake Formation….
Data crawlers discover data
- Crawlers configured with credentials to read data sources
- Crawlers discover the data (tables, schemas) & store metadata in data catalog

Data Catalogs can be accessed by Glue jobs or other systems, e.g. management console in the organization
- Data publicized & made visible across the organization
Glue Job = ETL (Extract, Transform & Load) Job
- No need to manage compute used during data transform (serverless)
  - AWS allocates resources from a warm pool, only billed for used resources
- Can be started manually or via events
AWS Glue Architecture (including Data Catalogs)