Apache Iceberg is a high-performance table format designed for managing large,
evolving datasets, providing features such as schema evolution and time travel.
The AWS Glue Catalog,
on the other hand, is a fully managed metadata catalog that simplifies data discovery and schema management for data
lakes. Chalk provides functionality to query Iceberg-formatted data stored in an AWS Glue Catalog using the scan_iceberg
function.
The scan_iceberg
function is designed to query Iceberg-formatted data using metadata from a Hive-like catalog. The function takes the following arguments:
def scan_iceberg(
target: str,
catalog: BaseCatalog,
columns: Sequence[str],
custom_partitions: Mapping[str, tuple[CustomPartitionOperation, str]] = {}
):
CustomPartitionOperation = Literal["date_trunc(day)"]
target
: The name of the Iceberg table to read from. Formatted like database_name.table_name
catalog
: The catalog to use for metadata. Must be an instance of BaseCatalog
, e.g. a GlueCatalog
columns
: A sequence of column names to read from the tablecustom_partitions
: Optional. A sequence of partition columns to use for partition pruning.This function is intended for use in resolvers marked with static=True
, which indicates to the Chalk query planner
that the resolver can be executed at query planning time, rather than query execution time.
Here’s a simple example of how to use the scan_iceberg
function to read data from an Iceberg table stored in an AWS Glue Catalog:
from chalk.integrations import GlueCatalog
from chalk.operators import scan_iceberg
# Instantiate the GlueCatalog with AWS credentials and configuration.
glue_catalog = GlueCatalog(
name="aws_glue_catalog",
aws_region="us-west-2",
aws_role_arn="arn:aws:iam::123456789012:role/YourCatalogueAccessRole",
)
# Define a function to read data using the offline decorator.
@offline(static=True)
def read_data() -> DataFrame[Transaction.id, Transaction.amount, Transaction.timestamp]:
return scan_iceberg(
target="banking.transactions",
catalog=glue_catalog,
columns=("id", "amount", "timestamp"),
custom_partitions={"transaction_date": ("date_trunc(day)", "transaction_timestamp")},
)
Note that the column names in the columns
argument must match the column names in the Iceberg table schema.
The target
argument specifies the Iceberg table to read from, and the catalog
argument specifies the catalog to use for metadata.
The scan_iceberg
function supports pushdown filters and projections to optimize query performance.
Chalk’s query planner will automatically push down filters and projections to the underlying Iceberg table when possible.
This means that not all columns need to be read from the table, and filters can be applied to reduce the amount of data read.
Iceberg queries with filters applied will use the Iceberg partitioning functions stored in your Iceberg table’s metadata to only scan partitions that contain relevant data for your filter.
For example, when scanning an Iceberg table with a filter like event_timestamp > '2024-06-01 10:35:00'
, if the table is partitioned
by the column transform event_date = day(event_timestamp)
then Iceberg will only scan the partitions where event_date >= '2024-06-1'
.
If your Iceberg table is partitioned, but the partition configuration is missing a relevant partition transform, you can use the
custom_partitions
parameter to benefit from partition filtering without needing to evolve your source table’s schema.
For example, if your table was partitioned on identity(event_date)
but you know that the event_date
column is derived from
the event_timestamp
column:
scan_iceberg(
...,
custom_partitions = {
"event_date": ("date_trunc(day)", "event_timestamp"),
},
)
This has the same effect as adding a event_date = day(event_timestamp)
partition transform to your Iceberg table’s metadata schema.
If the target Iceberg table already has a partition spec by e.g. day(event_timestamp)
then the custom_partitions
argument would not be necessary.
To successfully query Iceberg data through AWS Glue, ensure that the IAM role or user used in your AWS credentials has the following permissions:
These permissions allow access to the AWS Glue metadata:
glue:GetDatabase
glue:GetTable
glue:GetPartition
glue:GetTableVersion
glue:GetTableVersions
These permissions allow reading of the actual data stored in your data lake (e.g., in Amazon S3):
s3:GetObject
s3:ListBucket
Properly configuring these permissions is crucial to ensure that your queries can access both the Glue catalog metadata and the underlying data without encountering authorization issues.
To add Chalk datasets to Glue, you can use the Dataset.write_to
method.
from chalk.integrations import GlueCatalog
catalog = GlueCatalog(
name="aws_glue_catalog",
aws_region="us-west-2",
catalog_id="123",
aws_role_arn="arn:aws:iam::123456789012:role/YourCatalogueAccessRole",
)
dataset.write_to(destination="database.table_name", catalog=catalog)
This will create a table referencing the dataset in specified location in the Glue catalog, making it available for querying using tools like AWS Athena.
To write to a Glue catalog, the IAM role or user used in your AWS credentials must have the following permissions:
glue:CreateTable
Note: this ‘create table’ operation runs from your client, not from the Chalk platform.