A Guide to AWS Glue: Data Catalog, Databases, Crawler, Triggers, with S3
In the world of data processing and ETL (Extract, Transform, Load), AWS Glue stands out as a robust service. In this guide, we will explore various aspects of AWS Glue, including the AWS Glue Data Catalog, databases, tables, partitions, crawlers, connections, jobs, triggers, and endpoints. We’ll walk through the process of setting up an S3 bucket, creating an IAM role, and demonstrate how to use these AWS Glue features effectively.
Introduction
Let’s start by briefly introducing the key concepts we’ll cover:
- AWS Glue Data Catalog: A centralized metadata repository that stores metadata about data sources, transformations, and targets.
- AWS Glue Database: A logical container that organizes tables, allowing for better data management.
- AWS Glue Tables: The structure that represents data in the AWS Glue Data Catalog.
- Partition in AWS: A way to organize data within a table based on the values of one or more columns.
- AWS Glue Crawlers: Tools that scan various data stores, extract metadata, and create table definitions.
- AWS Glue Connection: A resource that contains the properties needed to connect to your source or target data store.
- AWS Glue Jobs: An ETL process that extracts data from the source, transforms it and loads it into the target.
- AWS Glue Triggers: Events or conditions that can automatically invoke AWS Glue workflows.
- AWS Glue Endpoints: URLs that allow external systems to call AWS Glue API operations.
Setting Up S3 Bucket
Before diving into AWS Glue, we need to set up an S3 bucket and organize our data:
- Create a Bucket: Choose a unique name for your S3 bucket.
- Create Subfolders: Create subfolders in your S3 bucket to organize data efficiently. For example:
data/customer_database
(acting as a database folder)customer_csv
(acting as a table)dataload=20231031
(an AWS partition)upload CSV files here
scripts
(for saving Glue job scripts)temp_dir
(for temporary processing)athena_results
(to store Athena query responses)
The main hierarchy for the folder is like
- data/
customer_database/customer_csv/dataload=2023103/1upload CSV files here
- scripts
- temp_dir
- athena_results
The data
folder is the primary data source. The customer_database
folder serves as a database folder, and customer_csv
acts as a table. The dataload
folder represents an AWS partition, where the file defines the schema for the table.
IAM Role
To create and manage jobs and access resources, create an IAM role with the necessary permissions for AWS Glue.
AWS Data Catalog
Now, let’s start using AWS Glue by creating a new database in the AWS Glue Data Catalog.
Create a New Database:
- In the AWS Glue Console, navigate to “Databases” under the Data Catalog section.
- Click “Add Database.”
- Add the S3 folder path of
customer_database
- Enter a database name and click “Create.”
With the database created, it’s time to add tables.
Load Data from S3 to Database:
You can create tables manually or use a crawler to load a CSV file as a table. Here’s how to use a crawler:
- Under the database, navigate to “Tables” and click “Add table using crawler.”
- Add a crawler name.
- Choose an existing data store if your data is already mapped to a Glue table. Otherwise, create a new data store.
- Specify the S3 path, in this case, the
customer_csv
folder, created earlier, and click "Add an S3 data source." - Configure the IAM role created earlier.
- Choose the database name.
- Review the details and create the crawler.
- Run the crawler to load the CSV files from the S3 path and create a table.
To verify, go to the “Tables” tab under the database. You’ll find the newly created table listed there. Click on a table to view its details. You can also view the data in Athena if needed.
AWS Glue Connection
If you want to load this data into another database, you can set up a database connection. Create a connection, add a name, and database details to be used as a connection object.
AWS Glue Job
AWS Glue jobs are pivotal components of data processing and transformation. They serve as the core ETL (Extract, Transform, Load) mechanisms in AWS Glue, allowing you to seamlessly move data from a source to a target. In our scenario, both the source and target are S3 folders, acting as input and output tables using AWS Glue crawlers.
Steps to Create an AWS Glue Job:
- Access AWS Glue Console: Begin by navigating to the AWS Glue Console.
- Create a Visual Job: Choose the option to create a visual job, which offers an intuitive interface for job design.
- Start with a blank graph for the job, Configure Source and Target (S3): Define your source (S3) and target (S3) nodes within the job. Specify transformation steps to manipulate your data as needed.
- Review and Run: Once you’ve configured your job, review the settings. Save the job and run it. The job execution might take some time, but upon completion, you’ll find the processed data in your target S3 path.
Now, we’re in a position where we’ve processed data and have an output file in the target S3 location. But we’re not done yet. Let’s proceed to load this output into a database as a table.
Loading Output into a Database Table:
To achieve this, repeat the process by creating another AWS Glue crawler. This time, specify the S3 path you want to load and create a new table.
Triggering AWS Glue Jobs:
To automate your ETL workflow, you can set up triggers for AWS Glue jobs. These triggers can be time-bound (scheduled) or event-based. For testing purposes, you can create another AWS Glue job and run it upon the completion of the first job. This chaining of jobs demonstrates how seamlessly AWS Glue can handle data workflows.
In summary, this guide takes you through the essential aspects of AWS Glue, from setting up your S3 bucket and organizing data to creating databases, and tables, and using crawlers to automate data ingestion. We’ve discussed the significance of AWS Glue jobs in ETL processes and the ways you can trigger these jobs for data automation. AWS Glue simplifies data processing and transformation tasks, making your data workflows efficient and manageable.
To read about serverless Python app deployments, check out the link.
Feel free to contact me here on Linkedin, Follow me on Instagram, and leave a message (Whatsapp +923225847078) in case of any queries.
Happy learning!