Lightweight DataFrame wrapper around Chalk's execution engine.
The DataFrame class constructs query plans backed by libchalk and
can materialize them into Arrow tables. It offers a minimal API similar to
other DataFrame libraries while delegating heavy lifting to the underlying
engine.
A DataFrame wraps a plan and a mapping of materialized Arrow tables.
Operations construct new plans and return new DataFrame instances, leaving
previous ones untouched.
Logical representation of tabular data for query operations.
DataFrame provides a lazy evaluation model where operations build up a query
plan that executes only when materialized. Most users should use the class
methods like from_dict, from_arrow, or scan to create
DataFrames rather than calling the constructor directly.
from chalkdf import DataFrame
from chalk.features import _
# Create from a dictionary
df = DataFrame({"x": [1, 2, 3], "y": ["a", "b", "c"]})
# Apply operations
filtered = df.filter(_.x > 1)
result = filtered.run()
Create a DataFrame from a dictionary, Arrow table, or query plan.
For most use cases, prefer using class methods like from_dict,
from_arrow, or scan instead of calling this constructor directly.
from chalkdf import DataFrame
# Simple dictionary input
df = DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
# Or use the explicit class method (recommended)
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
Return the number of rows if this DataFrame has already been materialized.
Raising TypeError for non-materialized frames matches Python's default
behavior while avoiding implicitly executing the plan.
Scan files and return a DataFrame.
Currently supports CSV (with headers) and Parquet file formats.
from chalkdf import DataFrame
# Scan Parquet files
df = DataFrame.scan("sales_data", ["data/sales_2024.[Parquet](https://parquet.apache.org/)"])
# Scan CSV with explicit schema
import pyarrow as pa
schema = pa.schema([("id", pa.int64()), ("name", pa.string())])
df = DataFrame.scan("users", ["data/users.csv"], schema=schema)
import pyarrow as pa
from chalkdf import DataFrame
from chalk.sql import PostgreSQLSource
source = PostgreSQLSource(...)
schema = pa.schema([("user_id", pa.int64()), ("name", pa.string())])
df = DataFrame.from_datasource(source, "SELECT * FROM users", schema)
Add or replace columns.
Accepts multiple forms:
.alias(<name>)from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
# Add a new column using a dict with _ syntax
df2 = df.with_columns({"z": _.x + _.y})
# Add a new column using alias
df3 = df.with_columns((_.x + _.y).alias("z"))
Perform an as-of join with another DataFrame.
An as-of join is similar to a left join, but instead of matching on equality, it matches on the nearest key from the right DataFrame. This is commonly used for time-series data where you want to join with the most recent observation.
Important: Both DataFrames must be sorted by the on column before calling
this method. Use .order_by(on) to sort if needed.
Class methods for constructing new DataFrame instances from various data sources.
These methods provide multiple ways to create DataFrames:
import pyarrow as pa
from chalkdf import DataFrame
from chalk.sql import PostgreSQLSource
source = PostgreSQLSource(...)
schema = pa.schema([("user_id", pa.int64()), ("name", pa.string())])
df = DataFrame.from_datasource(source, "SELECT * FROM users", schema)
Methods for selecting, transforming, and manipulating columns.
These operations allow you to:
Add or replace columns.
Accepts multiple forms:
.alias(<name>)from chalkdf import DataFrame
from chalk.features import _
df = DataFrame.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
# Add a new column using a dict with _ syntax
df2 = df.with_columns({"z": _.x + _.y})
# Add a new column using alias
df3 = df.with_columns((_.x + _.y).alias("z"))
Methods for filtering and ordering rows.
These operations allow you to:
Methods for combining DataFrames and performing group-by operations.
Join operations combine two DataFrames based on matching keys. Aggregation operations group rows and compute summary statistics.
Methods for executing query plans and inspecting DataFrame structure.
These methods allow you to: