Integration with AWS Glue Catalog

Introduction

Apache Iceberg is a high-performance table format designed for managing large, evolving datasets, providing features such as schema evolution and time travel. The AWS Glue Catalog, on the other hand, is a fully managed metadata catalog that simplifies data discovery and schema management for data lakes. Chalk provides functionality to query Iceberg-formatted data stored in an AWS Glue Catalog using the scan_iceberg function.

Scanning Iceberg Data from Glue

The scan_iceberg function is designed to query Iceberg-formatted data using metadata from a Hive-like catalog. The function takes the following arguments:

def scan_iceberg(
    target: str,
    catalog: BaseCatalog,
    columns: Sequence[str],
    custom_partitions: Mapping[str, tuple[CustomPartitionOperation, str]] = {}
):

CustomPartitionOperation = Literal["date_trunc(day)"]

target: The name of the Iceberg table to read from. Formatted like database_name.table_name
catalog: The catalog to use for metadata. Must be an instance of BaseCatalog, e.g. a GlueCatalog
columns: A sequence of column names to read from the table
custom_partitions: Optional. A sequence of partition columns to use for partition pruning.

This function is intended for use in resolvers marked with static=True, which indicates to the Chalk query planner that the resolver can be executed at query planning time, rather than query execution time.

Example

Here’s a simple example of how to use the scan_iceberg function to read data from an Iceberg table stored in an AWS Glue Catalog:

from chalk.integrations import GlueCatalog
from chalk.operators import scan_iceberg

# Instantiate the GlueCatalog with AWS credentials and configuration.
glue_catalog = GlueCatalog(
    name="aws_glue_catalog",
    aws_region="us-west-2",
    aws_role_arn="arn:aws:iam::123456789012:role/YourCatalogueAccessRole",
)

# Define a function to read data using the offline decorator.
@offline(static=True)
def read_data() -> DataFrame[Transaction.id, Transaction.amount, Transaction.timestamp]:
    return scan_iceberg(
        target="banking.transactions",
        catalog=glue_catalog,
        columns=("id", "amount", "timestamp"),
        custom_partitions={"transaction_date": ("date_trunc(day)", "transaction_timestamp")},
    )

Note that the column names in the columns argument must match the column names in the Iceberg table schema. The target argument specifies the Iceberg table to read from, and the catalog argument specifies the catalog to use for metadata.

Pushdown Filters and Projections

The scan_iceberg function supports pushdown filters and projections to optimize query performance. Chalk’s query planner will automatically push down filters and projections to the underlying Iceberg table when possible. This means that not all columns need to be read from the table, and filters can be applied to reduce the amount of data read.

Iceberg queries with filters applied will use the Iceberg partitioning functions stored in your Iceberg table’s metadata to only scan partitions that contain relevant data for your filter.

For example, when scanning an Iceberg table with a filter like event_timestamp > '2024-06-01 10:35:00', if the table is partitioned by the column transform event_date = day(event_timestamp) then Iceberg will only scan the partitions where event_date >= '2024-06-1'.

Custom Partitions

If your Iceberg table is partitioned, but the partition configuration is missing a relevant partition transform, you can use the custom_partitions parameter to benefit from partition filtering without needing to evolve your source table’s schema.

For example, if your table was partitioned on identity(event_date) but you know that the event_date column is derived from the event_timestamp column:

scan_iceberg(
    ...,
    custom_partitions = {
        "event_date": ("date_trunc(day)", "event_timestamp"),
    },
)

This has the same effect as adding a event_date = day(event_timestamp) partition transform to your Iceberg table’s metadata schema.

If the target Iceberg table already has a partition spec by e.g. day(event_timestamp) then the custom_partitions argument would not be necessary.

Permissions

To successfully query Iceberg data through AWS Glue, ensure that the IAM role or user used in your AWS credentials has the following permissions:

Catalog Access Permissions

These permissions allow access to the AWS Glue metadata:

glue:GetDatabase
glue:GetTable
glue:GetPartition
glue:GetTableVersion
glue:GetTableVersions

Underlying Data Access Permissions

These permissions allow reading of the actual data stored in your data lake (e.g., in Amazon S3):

s3:GetObject
s3:ListBucket

Properly configuring these permissions is crucial to ensure that your queries can access both the Glue catalog metadata and the underlying data without encountering authorization issues.

Adding Chalk Datasets to Glue

To add Chalk datasets to Glue, you can use the Dataset.write_to method.

from chalk.integrations import GlueCatalog

catalog = GlueCatalog(
    name="aws_glue_catalog",
    aws_region="us-west-2",
    catalog_id="123",
    aws_role_arn="arn:aws:iam::123456789012:role/YourCatalogueAccessRole",
)

dataset.write_to(destination="database.table_name", catalog=catalog)

This will create a table referencing the dataset in specified location in the Glue catalog, making it available for querying using tools like AWS Athena.

Required IAM Permissions

To write to a Glue catalog, the IAM role or user used in your AWS credentials must have the following permissions:

glue:CreateTable

Note: this ‘create table’ operation runs from your client, not from the Chalk platform.

​Introduction

​Scanning Iceberg Data from Glue

​Example

​Pushdown Filters and Projections

​Custom Partitions

​Permissions

​Catalog Access Permissions

​Underlying Data Access Permissions

​Adding Chalk Datasets to Glue

​Required IAM Permissions

On this page