Developer Interface
Install and run your first chalkdf commands
This guide will walk you through installing chalkdf and performing some basic operations.
If you are new to chalkdf or DataFrame libraries, this is a great place to start. If you
are already familiar with DataFrame libraries you can skip ahead to the full
Reference Library or the installation guide
Given Python 3.10-3.13 and Linux/macOS 15+, you can install chalkdf via pip:
uv pip install "chalkdf[chalkpy]"chalkdf supports reading data from several formats and sources, including JSON, CSV, and Parquet files,
Arrow objects and tables, as well as AWS Glue Iceberg tables, and SQL sources defined via
chalkpy.
Below we can initialize a chalkdf.DataFrame from a Python dictionary.
from chalkdf import DataFrame
df = DataFrame(
{
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"city": ["New York", "Los Angeles", "Chicago"],
}
)
df.run()DataFrame(materialized 3 rows x 3 columns)
┌─────────┬───────┬─────────────┐
│ name ┆ age ┆ city │
│ ─────── ┆ ───── ┆ ─────────── │
│ string ┆ int64 ┆ string │
╞═════════╪═══════╪═════════════╡
│ Alice ┆ 25 ┆ New York │
│ Bob ┆ 30 ┆ Los Angeles │
│ Charlie ┆ 35 ┆ Chicago │
└─────────┴───────┴─────────────┘
You can also scan files directly into a DataFrame. Below, we scan a CSV file:
from chalkdf import DataFrame
import pyarrow as pa
df = DataFrame.scan(
"people",
['people.csv'],
schema=pa.schema([
('name', pa.string()),
('birthday', pa.date32()),
('occupation', pa.string())
])
)
df.run()DataFrame(materialized 3 rows x 3 columns)
┌───────────────┬─────────────┬───────────────────┐
│ name ┆ birthday ┆ occupation │
│ ───────────── ┆ ─────────── ┆ ───────────────── │
│ string ┆ date32[day] ┆ string │
╞═══════════════╪═════════════╪═══════════════════╡
│ Alice Chen ┆ 1990-03-15 ┆ Software Engineer │
│ Bob Smith ┆ 1985-07-22 ┆ Teacher │
│ Carol Johnson ┆ 1992-11-08 ┆ Data Analyst │
└───────────────┴─────────────┴───────────────────┘
DataFrame expressions allow you to transform and explore your data through a variety of operations.
For example, given a DataFrame of values with id’s, you could group the values by id and compute the
aggregate sum. We use the underscore notation _ to reference the current DataFrame, to easily
access columns without any additional wrappers. You can import _ from the chalk module.
>>> from chalk import _
>>> df.run()
DataFrame(materialized 5 rows x 2 columns)
┌───────┬───────┐
│ id ┆ value │
│ ───── ┆ ───── │
│ int64 ┆ int64 │
╞═══════╪═══════╡
│ 1 ┆ 10 │
│ 1 ┆ 20 │
│ 2 ┆ 1 │
│ 2 ┆ 2 │
│ 2 ┆ 3 │
└───────┴───────┘>>> from chalk.features import _
>>> grouped = df.agg(["id"], _.value.sum().alias("value_sum_by_id"))
>>> grouped.run()
DataFrame(materialized 2 rows x 2 columns)
┌───────┬─────────────────┐
│ id ┆ value_sum_by_id │
│ ───── ┆ ─────────────── │
│ int64 ┆ int64 │
╞═══════╪═════════════════╡
│ 1 ┆ 30 │
│ 2 ┆ 6 │
└───────┴─────────────────┘Below are some common operations you can use to construct DataFrame expressions.
DataFrame.select allows you to extract specific columns from a DataFrame.
>>> interactions = df.select("user_id", "item_id", "interaction_type", "timestamp")
>>> interactions.run()
DataFrame(materialized 5 rows x 4 columns)
┌──────────┬───────────┬──────────────────┬─────────────────────┐
│ user_id ┆ item_id ┆ interaction_type ┆ timestamp │
│ ──────── ┆ ───────── ┆ ──────────────── ┆ ─────────────────── │
│ string ┆ string ┆ string ┆ timestamp[us] │
╞══════════╪═══════════╪══════════════════╪═════════════════════╡
│ user_001 ┆ item_4521 ┆ click ┆ 2024-11-08 14:23:15 │
│ user_002 ┆ item_8832 ┆ purchase ┆ 2024-11-08 15:10:42 │
│ user_003 ┆ item_1203 ┆ view ┆ 2024-11-08 16:05:30 │
│ user_004 ┆ item_4521 ┆ add_to_cart ┆ 2024-11-08 17:45:12 │
│ user_005 ┆ item_9944 ┆ click ┆ 2024-11-08 18:20:05 │
└──────────┴───────────┴──────────────────┴─────────────────────┘To add new columns or modify existing columns in your DataFrame while retaining all original
columns, you can use DataFrame.with_columns.
>>> from chalk import _
>>> processed_interactions = df.with_columns({"user_id": _.user_id, "long_session": _.session_duration_sec > 180})
>>> processed_interactions.select("user_id", "session_duration_sec", "long_session").run()
DataFrame(materialized 5 rows x 3 columns)
┌──────────┬──────────────────────┬──────────────┐
│ user_id ┆ session_duration_sec ┆ long_session │
│ ──────── ┆ ──────────────────── ┆ ──────────── │
│ string ┆ int64 ┆ bool │
╞══════════╪══════════════════════╪══════════════╡
│ user_001 ┆ 145 ┆ False │
│ user_002 ┆ 320 ┆ True │
│ user_003 ┆ 78 ┆ False │
│ user_004 ┆ 210 ┆ True │
│ user_005 ┆ 167 ┆ False │
└──────────┴──────────────────────┴──────────────┘To define more complex transformations without retaining all original columns, you can use DataFrame.project to take
in a DataFrame as input and project to new columns based on provided expressions. In these expressions, you can use the
underscore notation _ to reference columns within your source DataFrame, and utilize the chalk.functions
library (typically imported as F) for a variety of operations.
from chalk import functions as F, _
from chalkdf import DataFrame
import pyarrow as pa
tbl = pa.table(
{
"txns_last_hour": [[1, 2, 3, 4, 5], [100], [200, 201]],
"max_txns_allowed": [3, 5, 4],
}
)
df = DataFrame.from_arrow(tbl)
out = df.project(
{
"velocity_score": _.txns_last_hour
.cardinality()
.cast(float)
.least(_.max_txns_allowed + 0.5)
.ceil()
.cast(int),
"velocity_score_2": F.cast(
F.ceil(
F.least(
F.cast(F.cardinality(_.txns_last_hour), float),
_.max_txns_allowed + 0.5
)
),
int,
),
}
)
out.run()DataFrame(materialized 3 rows x 2 columns)
┌────────────────┬──────────────────┐
│ velocity_score ┆ velocity_score_2 │
│ ────────────── ┆ ──────────────── │
│ int64 ┆ int64 │
╞════════════════╪══════════════════╡
│ 4 ┆ 4 │
│ 1 ┆ 1 │
│ 2 ┆ 2 │
└────────────────┴──────────────────┘
You can filter rows in a DataFrame using DataFrame.filter, which takes in a boolean expression.
>>> from chalk import _
>>> df.run()
DataFrame(materialized 5 rows x 7 columns)
┌──────────┬───────────┬──────────────────┬─────────────────────┬────────┬─────────────┬──────────────────────┐
│ user_id ┆ item_id ┆ interaction_type ┆ timestamp ┆ score ┆ category ┆ session_duration_sec │
│ ──────── ┆ ───────── ┆ ──────────────── ┆ ─────────────────── ┆ ────── ┆ ─────────── ┆ ──────────────────── │
│ string ┆ string ┆ string ┆ timestamp[us] ┆ double ┆ string ┆ int64 │
╞══════════╪═══════════╪══════════════════╪═════════════════════╪════════╪═════════════╪══════════════════════╡
│ user_001 ┆ item_4521 ┆ click ┆ 2024-11-08 14:23:15 ┆ 0.85 ┆ electronics ┆ 145 │
│ user_002 ┆ item_8832 ┆ purchase ┆ 2024-11-08 15:10:42 ┆ 0.92 ┆ fashion ┆ 320 │
│ user_003 ┆ item_1203 ┆ view ┆ 2024-11-08 16:05:30 ┆ 0.67 ┆ home ┆ 78 │
│ user_004 ┆ item_4521 ┆ add_to_cart ┆ 2024-11-08 17:45:12 ┆ 0.78 ┆ electronics ┆ 210 │
│ user_005 ┆ item_9944 ┆ click ┆ 2024-11-08 18:20:05 ┆ 0.81 ┆ sports ┆ 167 │
└──────────┴───────────┴──────────────────┴─────────────────────┴────────┴─────────────┴──────────────────────┘>>> df.filter(_.score > 0.8).run()
DataFrame(materialized 3 rows x 7 columns)
┌──────────┬───────────┬──────────────────┬─────────────────────┬────────┬─────────────┬──────────────────────┐
│ user_id ┆ item_id ┆ interaction_type ┆ timestamp ┆ score ┆ category ┆ session_duration_sec │
│ ──────── ┆ ───────── ┆ ──────────────── ┆ ─────────────────── ┆ ────── ┆ ─────────── ┆ ──────────────────── │
│ string ┆ string ┆ string ┆ timestamp[us] ┆ double ┆ string ┆ int64 │
╞══════════╪═══════════╪══════════════════╪═════════════════════╪════════╪═════════════╪══════════════════════╡
│ user_001 ┆ item_4521 ┆ click ┆ 2024-11-08 14:23:15 ┆ 0.85 ┆ electronics ┆ 145 │
│ user_002 ┆ item_8832 ┆ purchase ┆ 2024-11-08 15:10:42 ┆ 0.92 ┆ fashion ┆ 320 │
│ user_005 ┆ item_9944 ┆ click ┆ 2024-11-08 18:20:05 ┆ 0.81 ┆ sports ┆ 167 │
└──────────┴───────────┴──────────────────┴─────────────────────┴────────┴─────────────┴──────────────────────┘To compute aggregations over groups of data, you can use DataFrame.agg.
>>> processed_interactions.agg(
... ["long_session"],
... processed_interactions.column("score").mean().alias("avg_score")
... ).run()
DataFrame(materialized 2 rows x 2 columns)
┌──────────────┬───────────┐
│ long_session ┆ avg_score │
│ ──────────── ┆ ───────── │
│ bool ┆ double │
╞══════════════╪═══════════╡
│ False ┆ 0.776667 │
│ True ┆ 0.85 │
└──────────────┴───────────┘You can combine data from multiple DataFrames using DataFrame.join. Below is an example of joining two
DataFrames using an inner join on the user_id column. You can also specify right and left joins.
>>> txns_df.join(
... users_df,
... on=["user_id"],
... how="inner"
... ).select(
... "transaction_id",
... "user_id",
... "name",
... "amount",
... "tier",
... "status"
... ).run()
DataFrame(materialized 5 rows x 6 columns)
┌────────────────┬──────────┬─────────┬────────┬─────────┬───────────┐
│ transaction_id ┆ user_id ┆ name ┆ amount ┆ tier ┆ status │
│ ────────────── ┆ ──────── ┆ ─────── ┆ ────── ┆ ─────── ┆ ───────── │
│ string ┆ string ┆ string ┆ double ┆ string ┆ string │
╞════════════════╪══════════╪═════════╪════════╪═════════╪═══════════╡
│ txn_101 ┆ user_001 ┆ Alice ┆ 49.99 ┆ premium ┆ completed │
│ txn_102 ┆ user_002 ┆ Bob ┆ 19.99 ┆ basic ┆ completed │
│ txn_103 ┆ user_001 ┆ Alice ┆ 89.5 ┆ premium ┆ pending │
│ txn_104 ┆ user_003 ┆ Charlie ┆ 120 ┆ premium ┆ completed │
│ txn_105 ┆ user_001 ┆ Alice ┆ 15.75 ┆ premium ┆ completed │
└────────────────┴──────────┴─────────┴────────┴─────────┴───────────┘