Lightweight DataFrame wrapper around Chalk's execution engine.
The :class:DataFrame class constructs query plans backed by libchalk and
can materialize them into Arrow tables. It offers a minimal API similar to
other DataFrame libraries while delegating heavy lifting to the underlying
engine.
A :class:DataFrame wraps a plan and a mapping of materialized Arrow tables.
Operations construct new plans and return new [DataFrame](#DataFrame) instances, leaving
previous ones untouched.
Logical representation of tabular data.
A :class:DataFrame wraps a :class:~libchalk.chalktable.ChalkTable
plan and a mapping of materialized Arrow tables. Operations construct new
plans and return new [DataFrame](#DataFrame) instances, leaving previous ones
untouched.
Scan Parquet files and return a DataFrame. :param input_uris: List of URIs to scan. :param schema: Schema of the data. :param num_concurrent_downloads: Number of concurrent downloads. :param max_num_batches_to_buffer: Maximum number of batches to buffer. :param target_batch_size_bytes: Target batch size in bytes. :param observed_at_partition_key: Partition key for observed_at. :return: DataFrame
Scan Parquet files and return a DataFrame. :param name: A name to call the table being scanned. :param input_uris: List of URIs to scan. :param schema: Schema of the data. :return: DataFrame
Load data from an AWS Glue Iceberg table.
:param glue_table_name: Fully qualified database.table name.
:param schema: Mapping of column names to Arrow types.
:param batch_row_count: Number of rows per batch.
:param aws_catalog_account_id: AWS account hosting the Glue catalog.
:param aws_catalog_region: Region of the Glue catalog.
:param aws_role_arn: IAM role to assume for access.
:param filter_predicate: Optional filter applied during scan.
:param Parquet_scan_range_column: Column used for range-based reads.
:param custom_partitions: Additional partition definitions.
:param partition_column: Column name representing partitions.
:return: DataFrame backed by the Glue table.
Class methods for constructing new DataFrame instances from various data sources.
These methods provide multiple ways to create DataFrames:
Scan Parquet files and return a DataFrame. :param input_uris: List of URIs to scan. :param schema: Schema of the data. :param num_concurrent_downloads: Number of concurrent downloads. :param max_num_batches_to_buffer: Maximum number of batches to buffer. :param target_batch_size_bytes: Target batch size in bytes. :param observed_at_partition_key: Partition key for observed_at. :return: DataFrame
Scan Parquet files and return a DataFrame. :param name: A name to call the table being scanned. :param input_uris: List of URIs to scan. :param schema: Schema of the data. :return: DataFrame
Load data from an AWS Glue Iceberg table.
:param glue_table_name: Fully qualified database.table name.
:param schema: Mapping of column names to Arrow types.
:param batch_row_count: Number of rows per batch.
:param aws_catalog_account_id: AWS account hosting the Glue catalog.
:param aws_catalog_region: Region of the Glue catalog.
:param aws_role_arn: IAM role to assume for access.
:param filter_predicate: Optional filter applied during scan.
:param Parquet_scan_range_column: Column used for range-based reads.
:param custom_partitions: Additional partition definitions.
:param partition_column: Column name representing partitions.
:return: DataFrame backed by the Glue table.
Methods for selecting, transforming, and manipulating columns.
These operations allow you to:
Methods for filtering and ordering rows.
These operations allow you to:
Methods for combining DataFrames and performing group-by operations.
Join operations combine two DataFrames based on matching keys. Aggregation operations group rows and compute summary statistics.
Methods for executing query plans and inspecting DataFrame structure.
These methods allow you to: