Ref: https://learn.cantrill.io/courses/1820301/lectures/42197048
AWS Glue - High-level overview
- 🔧 Serverless ETL (Extract, Transform & Load)
- …vs AWS Data Pipeline
- Can do ETL, but uses compute servers (specifically, creates EMR clusters)
- 💡 In exam, choose Glue if you need serverless, ad-hoc or cost-effective ETL solution
- At a high level, moves & transforms data between data source(s) & data destination(s)
- Another main functionality: crawls data sources & generates data catalogs
- Data Sources:
- Stores (S3)
- DBs: RDS, JDBC compatible DBs (Redshift, DynamoDB…)
- Streams (Kinesis Data Streams & Apache Kafka)
- Data Targets:
- Stores (S3)
- DBs: RDS, JDBC compatible DBs
AWS Glue Data Catalogs
- 🔧 Data Catalog = collection of metadata combined with data management & search tools
- In AWS Glue → persistent metadata about data sources in a region
- âť—Â Glue provides one catalog per region per account
- Helps avoid data silos → improves visibility of data across an account
- Metadata can be browsed & brought into other systems via Glue's ETL features
- Various data related products can use Glue data catalogs & ETL features:
- Amazon Athena, Amazon Redshift Spectrum, Amazon EMR, AWS Lake Formation….
- Data crawlers discover data
- Crawlers configured with credentials to read data sources
- Crawlers discover the data (tables, schemas) & store metadata in data catalog
AWS Glue - Architecture
- Data Catalogs can be accessed by Glue jobs or other systems, e.g. management console in the organization
- Data publicized & made visible across the organization
- Glue Job = ETL (Extract, Transform & Load) Job
- No need to manage compute used during data transform (serverless)
- AWS allocates resources from a warm pool, only billed for used resources
- Can be started manually or via events
- AWS Glue Architecture (including Data Catalogs)