This guide will walk you through installing chalkdf and performing some basic operations. If you are new to chalkdf or DataFrame libraries, this is a great place to start. If you are already familiar with DataFrame libraries you can skip ahead to the full Reference Library or the installation guide

Installation

Given Python 3.10-3.13 and Linux/macOS 15+, you can install chalkdf via pip:

uv pip install "chalkdf[chalkpy]"

Reading Data

chalkdf supports reading data from several formats and sources, including JSON, CSV, and Parquet files, Arrow objects and tables, as well as AWS Glue Iceberg tables, and SQL sources defined via chalkpy.

Below we can initialize a chalkdf.DataFrame from a Python dictionary.

from chalkdf import DataFrame

df = DataFrame(
    {
        "name": ["Alice", "Bob", "Charlie"],
        "age": [25, 30, 35],
        "city": ["New York", "Los Angeles", "Chicago"],
    }
)

df.run()
DataFrame(materialized 3 rows x 3 columns)
┌─────────┬───────┬─────────────┐
│ name    ┆ age   ┆ city        │
│ ─────── ┆ ───── ┆ ─────────── │
│ string  ┆ int64 ┆ string      │
╞═════════╪═══════╪═════════════╡
│ Alice   ┆ 25    ┆ New York    │
│ Bob     ┆ 30    ┆ Los Angeles │
│ Charlie ┆ 35    ┆ Chicago     │
└─────────┴───────┴─────────────┘

You can also scan files directly into a DataFrame. Below, we scan a CSV file:

from chalkdf import DataFrame
import pyarrow as pa

df = DataFrame.scan(
    "people",
    ['people.csv'],
    schema=pa.schema([
        ('name', pa.string()),
        ('birthday', pa.date32()),
        ('occupation', pa.string())
    ])
)
df.run()
DataFrame(materialized 3 rows x 3 columns)
┌───────────────┬─────────────┬───────────────────┐
│ name          ┆ birthday    ┆ occupation        │
│ ───────────── ┆ ─────────── ┆ ───────────────── │
│ string        ┆ date32[day] ┆ string            │
╞═══════════════╪═════════════╪═══════════════════╡
│ Alice Chen    ┆ 1990-03-15  ┆ Software Engineer │
│ Bob Smith     ┆ 1985-07-22  ┆ Teacher           │
│ Carol Johnson ┆ 1992-11-08  ┆ Data Analyst      │
└───────────────┴─────────────┴───────────────────┘

DataFrame Expressions

DataFrame expressions allow you to transform and explore your data through a variety of operations. For example, given a DataFrame of values with id’s, you could group the values by id and compute the aggregate sum. We use the underscore notation _ to reference the current DataFrame, to easily access columns without any additional wrappers. You can import _ from the chalk module.

>>> from chalk import _
>>> df.run()

DataFrame(materialized 5 rows x 2 columns)
┌───────┬───────┐
│ id    ┆ value │
│ ───── ┆ ───── │
│ int64 ┆ int64 │
╞═══════╪═══════╡
│ 110    │
│ 120    │
│ 21     │
│ 22     │
│ 23     │
└───────┴───────┘
>>> from chalk.features import _
>>> grouped = df.agg(["id"], _.value.sum().alias("value_sum_by_id"))
>>> grouped.run()

DataFrame(materialized 2 rows x 2 columns)
┌───────┬─────────────────┐
│ id    ┆ value_sum_by_id │
│ ───── ┆ ─────────────── │
│ int64 ┆ int64           │
╞═══════╪═════════════════╡
│ 130              │
│ 26               │
└───────┴─────────────────┘

Below are some common operations you can use to construct DataFrame expressions.

select

DataFrame.select allows you to extract specific columns from a DataFrame.

>>> interactions = df.select("user_id", "item_id", "interaction_type", "timestamp")
>>> interactions.run()

DataFrame(materialized 5 rows x 4 columns)
┌──────────┬───────────┬──────────────────┬─────────────────────┐
│ user_id  ┆ item_id   ┆ interaction_type ┆ timestamp           │
│ ──────── ┆ ───────── ┆ ──────────────── ┆ ─────────────────── │
│ string   ┆ string    ┆ string           ┆ timestamp[us]       │
╞══════════╪═══════════╪══════════════════╪═════════════════════╡
│ user_001 ┆ item_4521 ┆ click            ┆ 2024-11-08 14:23:15 │
│ user_002 ┆ item_8832 ┆ purchase         ┆ 2024-11-08 15:10:42 │
│ user_003 ┆ item_1203 ┆ view             ┆ 2024-11-08 16:05:30 │
│ user_004 ┆ item_4521 ┆ add_to_cart      ┆ 2024-11-08 17:45:12 │
│ user_005 ┆ item_9944 ┆ click            ┆ 2024-11-08 18:20:05 │
└──────────┴───────────┴──────────────────┴─────────────────────┘

with_columns

To add new columns or modify existing columns in your DataFrame while retaining all original columns, you can use DataFrame.with_columns.

>>> from chalk import _
>>> processed_interactions = df.with_columns({"user_id": _.user_id, "long_session": _.session_duration_sec > 180})
>>> processed_interactions.select("user_id", "session_duration_sec", "long_session").run()

DataFrame(materialized 5 rows x 3 columns)
┌──────────┬──────────────────────┬──────────────┐
│ user_id  ┆ session_duration_sec ┆ long_session │
│ ──────── ┆ ──────────────────── ┆ ──────────── │
│ string   ┆ int64                ┆ bool         │
╞══════════╪══════════════════════╪══════════════╡
│ user_001 ┆ 145False        │
│ user_002 ┆ 320True         │
│ user_003 ┆ 78False        │
│ user_004 ┆ 210True         │
│ user_005 ┆ 167False        │
└──────────┴──────────────────────┴──────────────┘

project

To define more complex transformations without retaining all original columns, you can use DataFrame.project to take in a DataFrame as input and project to new columns based on provided expressions. In these expressions, you can use the underscore notation _ to reference columns within your source DataFrame, and utilize the chalk.functions library (typically imported as F) for a variety of operations.

from chalk import functions as F, _
from chalkdf import DataFrame
import pyarrow as pa

tbl = pa.table(
  {
    "txns_last_hour": [[1, 2, 3, 4, 5], [100], [200, 201]],
    "max_txns_allowed": [3, 5, 4],
  }
)

df = DataFrame.from_arrow(tbl)

out = df.project(
  {
    "velocity_score": _.txns_last_hour
      .cardinality()
      .cast(float)
      .least(_.max_txns_allowed + 0.5)
      .ceil()
      .cast(int),
    "velocity_score_2": F.cast(
      F.ceil(
        F.least(
          F.cast(F.cardinality(_.txns_last_hour), float),
          _.max_txns_allowed + 0.5
        )
      ),
      int,
    ),
  }
)
out.run()
DataFrame(materialized 3 rows x 2 columns)
┌────────────────┬──────────────────┐
│ velocity_score ┆ velocity_score_2 │
│ ────────────── ┆ ──────────────── │
│ int64          ┆ int64            │
╞════════════════╪══════════════════╡
│ 4              ┆ 4                │
│ 1              ┆ 1                │
│ 2              ┆ 2                │
└────────────────┴──────────────────┘

filter

You can filter rows in a DataFrame using DataFrame.filter, which takes in a boolean expression.

>>> from chalk import _
>>> df.run()

DataFrame(materialized 5 rows x 7 columns)
┌──────────┬───────────┬──────────────────┬─────────────────────┬────────┬─────────────┬──────────────────────┐
│ user_id  ┆ item_id   ┆ interaction_type ┆ timestamp           ┆ score  ┆ category    ┆ session_duration_sec │
│ ──────── ┆ ───────── ┆ ──────────────── ┆ ─────────────────── ┆ ────── ┆ ─────────── ┆ ──────────────────── │
│ string   ┆ string    ┆ string           ┆ timestamp[us]       ┆ double ┆ string      ┆ int64                │
╞══════════╪═══════════╪══════════════════╪═════════════════════╪════════╪═════════════╪══════════════════════╡
│ user_001 ┆ item_4521 ┆ click            ┆ 2024-11-08 14:23:150.85   ┆ electronics ┆ 145                  │
│ user_002 ┆ item_8832 ┆ purchase         ┆ 2024-11-08 15:10:420.92   ┆ fashion     ┆ 320                  │
│ user_003 ┆ item_1203 ┆ view             ┆ 2024-11-08 16:05:300.67   ┆ home        ┆ 78                   │
│ user_004 ┆ item_4521 ┆ add_to_cart      ┆ 2024-11-08 17:45:120.78   ┆ electronics ┆ 210                  │
│ user_005 ┆ item_9944 ┆ click            ┆ 2024-11-08 18:20:050.81   ┆ sports      ┆ 167                  │
└──────────┴───────────┴──────────────────┴─────────────────────┴────────┴─────────────┴──────────────────────┘
>>> df.filter(_.score > 0.8).run()

DataFrame(materialized 3 rows x 7 columns)
┌──────────┬───────────┬──────────────────┬─────────────────────┬────────┬─────────────┬──────────────────────┐
│ user_id  ┆ item_id   ┆ interaction_type ┆ timestamp           ┆ score  ┆ category    ┆ session_duration_sec │
│ ──────── ┆ ───────── ┆ ──────────────── ┆ ─────────────────── ┆ ────── ┆ ─────────── ┆ ──────────────────── │
│ string   ┆ string    ┆ string           ┆ timestamp[us]       ┆ double ┆ string      ┆ int64                │
╞══════════╪═══════════╪══════════════════╪═════════════════════╪════════╪═════════════╪══════════════════════╡
│ user_001 ┆ item_4521 ┆ click            ┆ 2024-11-08 14:23:150.85   ┆ electronics ┆ 145                  │
│ user_002 ┆ item_8832 ┆ purchase         ┆ 2024-11-08 15:10:420.92   ┆ fashion     ┆ 320                  │
│ user_005 ┆ item_9944 ┆ click            ┆ 2024-11-08 18:20:050.81   ┆ sports      ┆ 167                  │
└──────────┴───────────┴──────────────────┴─────────────────────┴────────┴─────────────┴──────────────────────┘

agg

To compute aggregations over groups of data, you can use DataFrame.agg.

>>> processed_interactions.agg(
...    ["long_session"],
...    processed_interactions.column("score").mean().alias("avg_score")
... ).run()

DataFrame(materialized 2 rows x 2 columns)
┌──────────────┬───────────┐
│ long_session ┆ avg_score │
│ ──────────── ┆ ───────── │
│ bool         ┆ double    │
╞══════════════╪═══════════╡
│ False0.776667  │
│ True0.85      │
└──────────────┴───────────┘

join

You can combine data from multiple DataFrames using DataFrame.join. Below is an example of joining two DataFrames using an inner join on the user_id column. You can also specify right and left joins.

>>> txns_df.join(
...    users_df,
...    on=["user_id"],
...    how="inner"
... ).select(
...    "transaction_id",
...    "user_id",
...    "name",
...    "amount",
...    "tier",
...    "status"
... ).run()

DataFrame(materialized 5 rows x 6 columns)
┌────────────────┬──────────┬─────────┬────────┬─────────┬───────────┐
│ transaction_id ┆ user_id  ┆ name    ┆ amount ┆ tier    ┆ status    │
│ ────────────── ┆ ──────── ┆ ─────── ┆ ────── ┆ ─────── ┆ ───────── │
│ string         ┆ string   ┆ string  ┆ double ┆ string  ┆ string    │
╞════════════════╪══════════╪═════════╪════════╪═════════╪═══════════╡
│ txn_101        ┆ user_001 ┆ Alice   ┆ 49.99  ┆ premium ┆ completed │
│ txn_102        ┆ user_002 ┆ Bob     ┆ 19.99  ┆ basic   ┆ completed │
│ txn_103        ┆ user_001 ┆ Alice   ┆ 89.5   ┆ premium ┆ pending   │
│ txn_104        ┆ user_003 ┆ Charlie ┆ 120    ┆ premium ┆ completed │
│ txn_105        ┆ user_001 ┆ Alice   ┆ 15.75  ┆ premium ┆ completed │
└────────────────┴──────────┴─────────┴────────┴─────────┴───────────┘