Ref: https://learn.cantrill.io/courses/1820301/lectures/42197048
AWS Glue - Overview
- 🔧 Serverless ETL (Extract, Transform & Load) & Serverless data catalogs
- …vs AWS Data Pipeline
- Can do ETL, but uses compute servers (specifically, creates Amazon EMR clusters)
- 💡 In exam, choose Glue if you need a serverless, ad-hoc, or cost-effective ETL solution
 
 
- Two main functionalities
- Moves & transforms data between data source(s) & data destination(s)
- Crawls data sources & generates data catalogs
 
- Fully-managed by AWS
- Serverless, public, regionally resilient service
- Scales automatically
 
AWS Glue - Data Catalogs
- 🔧 Data Catalog = collection of metadata combined with data management & search tools
- AWS Glue data catalogs store persistent metadata about data sources in a region
 
- ❗ Glue provides one catalog per region & per account
- Helps avoid data silos → improves visibility of data across an account
- Metadata can be browsed & brought into other systems via Glue's ETL features
 
 
- Various data related products can use Glue data catalogs & ETL features:
- Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, AWS Lake Formation….
 
- Data crawlers discover data
- Crawlers configured with credentials to read data sources
- Crawlers discover the data (tables, schemas) & store metadata in data catalog
 
AWS Glue - Architecture
Diagram: https://github.com/acantril/aws-sa-associate-saac03/blob/main/1600-SERVERLESS_and_APPLICATION_SERVICES/00_LEARNINGAIDS/Glue.png
- Supported data sources:
- Data stores (e.g. S3)
- DBs: RDS, JDBC compatible DBs (Redshift, DynamoDB…)
- Streams (Kinesis Data Streams, Apache Kafka…)
 
- Supported data targets:
- Data stores (e.g. S3)
- DBs: RDS, JDBC compatible DBs
 
- Data Catalogs can be accessed by Glue jobs or other systems, e.g. AWS Management Console in an AWS Organization
- Data publicized & made visible across the organization