Ref: https://learn.cantrill.io/courses/1820301/lectures/42197048
AWS Glue - Overview
- đź”§Â Serverless ETL (Extract, Transform & Load) & Serverless data catalogs
- …vs AWS Data Pipeline
- Can do ETL, but uses compute servers (specifically, creates Amazon EMR clusters)
- 💡 In exam, choose Glue if you need a serverless, ad-hoc, or cost-effective ETL solution
- Two main functionalities
- Moves & transforms data between data source(s) & data destination(s)
- Crawls data sources & generates data catalogs
- Fully-managed by AWS
- Serverless, public, regionally resilient service
- Scales automatically
AWS Glue - Data Catalogs
- đź”§Â Data Catalog = collection of metadata combined with data management & search tools
- AWS Glue data catalogs store persistent metadata about data sources in a region
- âť—Â Glue provides one catalog per region & per account
- Helps avoid data silos → improves visibility of data across an account
- Metadata can be browsed & brought into other systems via Glue's ETL features
- Various data related products can use Glue data catalogs & ETL features:
- Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, AWS Lake Formation….
- Data crawlers discover data
- Crawlers configured with credentials to read data sources
- Crawlers discover the data (tables, schemas) & store metadata in data catalog
AWS Glue - Architecture

- Supported data sources:
- Data stores (e.g. S3)
- DBs: RDS, JDBC compatible DBs (Redshift, DynamoDB…)
- Streams (Kinesis Data Streams, Apache Kafka…)
- Supported data targets:
- Data stores (e.g. S3)
- DBs: RDS, JDBC compatible DBs
- Data Catalogs can be accessed by Glue jobs or other systems, e.g. AWS Management Console in an AWS Organization
- Data publicized & made visible across the organization