Chalk API - Chalk AI

DataFrame

Lightweight DataFrame wrapper around Chalk's execution engine.

The :class:DataFrame class constructs query plans backed by libchalk and can materialize them into Arrow tables. It offers a minimal API similar to other DataFrame libraries while delegating heavy lifting to the underlying engine.

A :class:DataFrame wraps a plan and a mapping of materialized Arrow tables. Operations construct new plans and return new [DataFrame](#DataFrame) instances, leaving previous ones untouched.

DataFrame

Class

Logical representation of tabular data.

A :class:DataFrame wraps a :class:~libchalk.chalktable.ChalkTable plan and a mapping of materialized Arrow tables. Operations construct new plans and return new [DataFrame](#DataFrame) instances, leaving previous ones untouched.

Functions

DataFrame.__init__(root, tables)

Create a [DataFrame](#DataFrame) from a plan or materialized Arrow table.

:param root: Either a ChalkTable plan or an in-memory Arrow table. :param tables: Mapping of additional table names to Arrow data.

DataFrame.named_table(name, schema)

Create a [DataFrame](#DataFrame) for a named table.

:param name: Table identifier. :param schema: Arrow schema describing the table. :return: DataFrame referencing the named table.

DataFrame.from_arrow(data)

Construct a [DataFrame](#DataFrame) from an in-memory Arrow object.

DataFrame.scan_parquet(input_uris, num_concurrent_downloads, ...+4)

Scan Parquet files and return a DataFrame. :param input_uris: List of URIs to scan. :param schema: Schema of the data. :param num_concurrent_downloads: Number of concurrent downloads. :param max_num_batches_to_buffer: Maximum number of batches to buffer. :param target_batch_size_bytes: Target batch size in bytes. :param observed_at_partition_key: Partition key for observed_at. :return: DataFrame

DataFrame.scan(name, input_uris, ...+1)

Scan files and return a DataFrame. Currently, CSV (with headers) and Parquet are supported. :param name: A name to call the table being scanned. :param input_uris: List of URIs to scan. :param schema: Schema of the data. :return: DataFrame

DataFrame.table_scan_parquet(name, input_uris, ...+1)

Scan Parquet files and return a DataFrame. :param name: A name to call the table being scanned. :param input_uris: List of URIs to scan. :param schema: Schema of the data. :return: DataFrame

DataFrame.scan_glue_iceberg(glue_table_name, schema, ...+8)

Load data from an AWS Glue Iceberg table.

:param glue_table_name: Fully qualified database.table name. :param schema: Mapping of column names to Arrow types. :param batch_row_count: Number of rows per batch. :param aws_catalog_account_id: AWS account hosting the Glue catalog. :param aws_catalog_region: Region of the Glue catalog. :param aws_role_arn: IAM role to assume for access. :param filter_predicate: Optional filter applied during scan. :param Parquet_scan_range_column: Column used for range-based reads. :param custom_partitions: Additional partition definitions. :param partition_column: Column name representing partitions. :return: DataFrame backed by the Glue table.

DataFrame.from_catalog_table(table_name, catalog)

Create a [DataFrame](#DataFrame) from a Chalk SQL catalog table.

DataFrame.from_datasource(source, query, ...+1)

Create a DataFrame from the result of querying a SQL source. :param source: SQL source to query. :param query: SQL query to execute. :param expected_output_schema: Output schema of the query result. The datasource's driver is expected to convert the native query result to this schema.

DataFrame.explain_logical()

Return a string representation of the logical plan.

DataFrame.explain_physical()

Return a string representation of the physical plan.

DataFrame.get_plan()

Expose the underlying :class:ChalkTable plan.

DataFrame.get_tables()

Return the mapping of materialized tables for this DataFrame.

DataFrame.with_columns(dict)

Add or replace columns based on a mapping of expressions.

DataFrame.with_unique_id(name)

Add a monotonically increasing unique identifier column.

DataFrame.filter(expr)

Filter rows according to expr.

DataFrame.slice(start, length)

Return a subset of rows starting at start with optional length.

DataFrame.col(column)

Get a column expression from the DataFrame. :param column: Column name. :return: Column expression.

DataFrame.column(column)

Get a column expression from the DataFrame. :param column: Column name. :return: Column expression.

DataFrame.project(columns)

Project to the provided column expressions.

DataFrame.select(columns)

Select existing columns by name.

DataFrame.explode(column)

Explode a column in the DataFrame. :param column: Column name to explode. :return: DataFrame with exploded column.

DataFrame.join(other, on, ...+1)

Join this [DataFrame](#DataFrame) with another.

:param other: Right-hand [DataFrame](#DataFrame). :param on: Column names or mapping of left->right join keys. :param how: Join type (e.g. "inner" or "left"). :return: Resulting [DataFrame](#DataFrame) after the join.

DataFrame.agg(by, aggregations)

Group by by columns and apply aggregation expressions.

DataFrame.order_by(columns)

Sort the [DataFrame](#DataFrame) by one or more columns.

DataFrame.rename(new_names)

Rename columns in the DataFrame. :param new_names: Dictionary mapping old column names to new column names. :return: DataFrame with renamed columns.

DataFrame.run(tables)

Execute the plan and yield resulting Arrow RecordBatches.

Creating DataFrames

Class methods for constructing new DataFrame instances from various data sources.

These methods provide multiple ways to create DataFrames:

From Arrow tables in memory
From Parquet files on disk or cloud storage
From AWS Glue Iceberg tables
From SQL datasources
From Chalk SQL catalog tables

named_table(name, schema)

Create a [DataFrame](#DataFrame) for a named table.

:param name: Table identifier. :param schema: Arrow schema describing the table. :return: DataFrame referencing the named table.

from_arrow(data)

Construct a [DataFrame](#DataFrame) from an in-memory Arrow object.

scan_parquet(input_uris, num_concurrent_downloads, ...+4)

Scan Parquet files and return a DataFrame. :param input_uris: List of URIs to scan. :param schema: Schema of the data. :param num_concurrent_downloads: Number of concurrent downloads. :param max_num_batches_to_buffer: Maximum number of batches to buffer. :param target_batch_size_bytes: Target batch size in bytes. :param observed_at_partition_key: Partition key for observed_at. :return: DataFrame

table_scan_parquet(name, input_uris, ...+1)

Scan Parquet files and return a DataFrame. :param name: A name to call the table being scanned. :param input_uris: List of URIs to scan. :param schema: Schema of the data. :return: DataFrame

scan_glue_iceberg(glue_table_name, schema, ...+8)

Load data from an AWS Glue Iceberg table.

:param glue_table_name: Fully qualified database.table name. :param schema: Mapping of column names to Arrow types. :param batch_row_count: Number of rows per batch. :param aws_catalog_account_id: AWS account hosting the Glue catalog. :param aws_catalog_region: Region of the Glue catalog. :param aws_role_arn: IAM role to assume for access. :param filter_predicate: Optional filter applied during scan. :param Parquet_scan_range_column: Column used for range-based reads. :param custom_partitions: Additional partition definitions. :param partition_column: Column name representing partitions. :return: DataFrame backed by the Glue table.

from_catalog_table(table_name, catalog)

Create a [DataFrame](#DataFrame) from a Chalk SQL catalog table.

from_datasource(source, query, ...+1)

Create a DataFrame from the result of querying a SQL source. :param source: SQL source to query. :param query: SQL query to execute. :param expected_output_schema: Output schema of the query result. The datasource's driver is expected to convert the native query result to this schema.

Column Operations

Methods for selecting, transforming, and manipulating columns.

These operations allow you to:

Select specific columns by name
Add or replace columns with new expressions
Rename columns
Project columns to new names
Get column expressions for use in filters and transformations
Add unique identifier columns
Explode array columns into multiple rows

col(column)

Get a column expression from the DataFrame. :param column: Column name. :return: Column expression.

column(column)

Get a column expression from the DataFrame. :param column: Column name. :return: Column expression.

select(columns)

Select existing columns by name.

with_columns(dict)

Add or replace columns based on a mapping of expressions.

project(columns)

Project to the provided column expressions.

rename(new_names)

Rename columns in the DataFrame. :param new_names: Dictionary mapping old column names to new column names. :return: DataFrame with renamed columns.

with_unique_id(name)

Add a monotonically increasing unique identifier column.

explode(column)

Explode a column in the DataFrame. :param column: Column name to explode. :return: DataFrame with exploded column.

Row Operations

Methods for filtering and ordering rows.

These operations allow you to:

Filter rows based on conditions
Sort rows by one or more columns
Select a subset of rows by position

filter(expr)

Filter rows according to expr.

order_by(columns)

Sort the [DataFrame](#DataFrame) by one or more columns.

slice(start, length)

Return a subset of rows starting at start with optional length.

Joins and Aggregations

Methods for combining DataFrames and performing group-by operations.

Join operations combine two DataFrames based on matching keys. Aggregation operations group rows and compute summary statistics.

join(other, on, ...+1)

Join this [DataFrame](#DataFrame) with another.

:param other: Right-hand [DataFrame](#DataFrame). :param on: Column names or mapping of left->right join keys. :param how: Join type (e.g. "inner" or "left"). :return: Resulting [DataFrame](#DataFrame) after the join.

agg(by, aggregations)

Group by by columns and apply aggregation expressions.