Chalk SDK Reference

Lightweight DataFrame wrapper around Chalk's execution engine.

The :class:DataFrame class constructs query plans backed by libchalk and can materialize them into Arrow tables. It offers a minimal API similar to other DataFrame libraries while delegating heavy lifting to the underlying engine.

A :class:DataFrame wraps a plan and a mapping of materialized Arrow tables. Operations construct new plans and return new [DataFrame](#DataFrame) instances, leaving previous ones untouched.

Logical representation of tabular data.

A :class:DataFrame wraps a :class:~libchalk.chalktable.ChalkTable plan and a mapping of materialized Arrow tables. Operations construct new plans and return new [DataFrame](#DataFrame) instances, leaving previous ones untouched.

Functions

Create a [DataFrame](#DataFrame) from a plan or materialized Arrow table.

:param root: Either a ChalkTable plan or an in-memory Arrow table. :param tables: Mapping of additional table names to Arrow data.

Create a [DataFrame](#DataFrame) for a named table.

:param name: Table identifier. :param schema: Arrow schema describing the table. :return: DataFrame referencing the named table.

Construct a [DataFrame](#DataFrame) from an in-memory Arrow object.

Scan Parquet files and return a DataFrame. :param input_uris: List of URIs to scan. :param schema: Schema of the data. :param num_concurrent_downloads: Number of concurrent downloads. :param max_num_batches_to_buffer: Maximum number of batches to buffer. :param target_batch_size_bytes: Target batch size in bytes. :param observed_at_partition_key: Partition key for observed_at. :return: DataFrame

Scan files and return a DataFrame. Currently, CSV (with headers) and Parquet are supported. :param name: A name to call the table being scanned. :param input_uris: List of URIs to scan. :param schema: Schema of the data. :return: DataFrame

Scan Parquet files and return a DataFrame. :param name: A name to call the table being scanned. :param input_uris: List of URIs to scan. :param schema: Schema of the data. :return: DataFrame

Load data from an AWS Glue Iceberg table.

:param glue_table_name: Fully qualified database.table name. :param schema: Mapping of column names to Arrow types. :param batch_row_count: Number of rows per batch. :param aws_catalog_account_id: AWS account hosting the Glue catalog. :param aws_catalog_region: Region of the Glue catalog. :param aws_role_arn: IAM role to assume for access. :param filter_predicate: Optional filter applied during scan. :param Parquet_scan_range_column: Column used for range-based reads. :param custom_partitions: Additional partition definitions. :param partition_column: Column name representing partitions. :return: DataFrame backed by the Glue table.

Create a [DataFrame](#DataFrame) from a Chalk SQL catalog table.

Create a DataFrame from the result of querying a SQL source. :param source: SQL source to query. :param query: SQL query to execute. :param expected_output_schema: Output schema of the query result. The datasource's driver is expected to convert the native query result to this schema.

Return a string representation of the logical plan.

Return a string representation of the physical plan.

Expose the underlying :class:ChalkTable plan.

Return the mapping of materialized tables for this DataFrame.

Add or replace columns based on a mapping of expressions.

Add a monotonically increasing unique identifier column.

Filter rows according to expr.

Return a subset of rows starting at start with optional length.

Get a column expression from the DataFrame. :param column: Column name. :return: Column expression.

Get a column expression from the DataFrame. :param column: Column name. :return: Column expression.

Project to the provided column expressions.

Select existing columns by name.

Explode a column in the DataFrame. :param column: Column name to explode. :return: DataFrame with exploded column.

Join this [DataFrame](#DataFrame) with another.

:param other: Right-hand [DataFrame](#DataFrame). :param on: Column names or mapping of left->right join keys. :param how: Join type (e.g. "inner" or "left"). :return: Resulting [DataFrame](#DataFrame) after the join.

Group by by columns and apply aggregation expressions.

Sort the [DataFrame](#DataFrame) by one or more columns.

Rename columns in the DataFrame. :param new_names: Dictionary mapping old column names to new column names. :return: DataFrame with renamed columns.

Execute the plan and yield resulting Arrow RecordBatches.

Class methods for constructing new DataFrame instances from various data sources.

These methods provide multiple ways to create DataFrames:

  • From Arrow tables in memory
  • From Parquet files on disk or cloud storage
  • From AWS Glue Iceberg tables
  • From SQL datasources
  • From Chalk SQL catalog tables

Create a [DataFrame](#DataFrame) for a named table.

:param name: Table identifier. :param schema: Arrow schema describing the table. :return: DataFrame referencing the named table.

Construct a [DataFrame](#DataFrame) from an in-memory Arrow object.

Scan Parquet files and return a DataFrame. :param input_uris: List of URIs to scan. :param schema: Schema of the data. :param num_concurrent_downloads: Number of concurrent downloads. :param max_num_batches_to_buffer: Maximum number of batches to buffer. :param target_batch_size_bytes: Target batch size in bytes. :param observed_at_partition_key: Partition key for observed_at. :return: DataFrame

Scan Parquet files and return a DataFrame. :param name: A name to call the table being scanned. :param input_uris: List of URIs to scan. :param schema: Schema of the data. :return: DataFrame

Load data from an AWS Glue Iceberg table.

:param glue_table_name: Fully qualified database.table name. :param schema: Mapping of column names to Arrow types. :param batch_row_count: Number of rows per batch. :param aws_catalog_account_id: AWS account hosting the Glue catalog. :param aws_catalog_region: Region of the Glue catalog. :param aws_role_arn: IAM role to assume for access. :param filter_predicate: Optional filter applied during scan. :param Parquet_scan_range_column: Column used for range-based reads. :param custom_partitions: Additional partition definitions. :param partition_column: Column name representing partitions. :return: DataFrame backed by the Glue table.

Create a [DataFrame](#DataFrame) from a Chalk SQL catalog table.

Create a DataFrame from the result of querying a SQL source. :param source: SQL source to query. :param query: SQL query to execute. :param expected_output_schema: Output schema of the query result. The datasource's driver is expected to convert the native query result to this schema.

Methods for selecting, transforming, and manipulating columns.

These operations allow you to:

  • Select specific columns by name
  • Add or replace columns with new expressions
  • Rename columns
  • Project columns to new names
  • Get column expressions for use in filters and transformations
  • Add unique identifier columns
  • Explode array columns into multiple rows

Get a column expression from the DataFrame. :param column: Column name. :return: Column expression.

Get a column expression from the DataFrame. :param column: Column name. :return: Column expression.

Select existing columns by name.

Add or replace columns based on a mapping of expressions.

Project to the provided column expressions.

Rename columns in the DataFrame. :param new_names: Dictionary mapping old column names to new column names. :return: DataFrame with renamed columns.

Add a monotonically increasing unique identifier column.

Explode a column in the DataFrame. :param column: Column name to explode. :return: DataFrame with exploded column.

Methods for filtering and ordering rows.

These operations allow you to:

  • Filter rows based on conditions
  • Sort rows by one or more columns
  • Select a subset of rows by position

Filter rows according to expr.

Sort the [DataFrame](#DataFrame) by one or more columns.

Return a subset of rows starting at start with optional length.

Methods for combining DataFrames and performing group-by operations.

Join operations combine two DataFrames based on matching keys. Aggregation operations group rows and compute summary statistics.

Join this [DataFrame](#DataFrame) with another.

:param other: Right-hand [DataFrame](#DataFrame). :param on: Column names or mapping of left->right join keys. :param how: Join type (e.g. "inner" or "left"). :return: Resulting [DataFrame](#DataFrame) after the join.

Group by by columns and apply aggregation expressions.

Methods for executing query plans and inspecting DataFrame structure.

These methods allow you to:

  • Execute the DataFrame plan and materialize results
  • View the logical query plan
  • View the physical execution plan
  • Access the underlying ChalkTable plan
  • Get materialized tables

Execute the plan and yield resulting Arrow RecordBatches.

Return a string representation of the logical plan.

Return a string representation of the physical plan.

Expose the underlying :class:ChalkTable plan.

Return the mapping of materialized tables for this DataFrame.

Execute the plan and yield resulting Arrow RecordBatches.

Return a string representation of the logical plan.

Return a string representation of the physical plan.

Expose the underlying :class:ChalkTable plan.

Return the mapping of materialized tables for this DataFrame.