Chalk home page
Docs
SDK
CLI
  1. DataFrame
  2. Getting Started

This guide will walk you through installing chalkdf and performing some basic operations. If you are new to chalkdf or DataFrame libraries, this is a great place to start. If you are already familiar with DataFrame libraries you can skip ahead to the full Reference Library or the installation guide

Installation

Given Python 3.10-3.13 and Linux/macOS 15+, you can install chalkdf via pip:

pip install "chalkdf[chalkpy]"

Reading Data

chalkdf supports reading data from several formats and sources, including JSON, CSV, and Parquet files, Arrow objects and tables, as well as AWS Glue Iceberg tables, and SQL sources defined via chalkpy.

Below we can initialize a chalkdf.DataFrame from a Python dictionary.

from chalkdf import DataFrame

df = DataFrame(
    {
        "name": ["Alice", "Bob", "Charlie"],
        "age": [25, 30, 35],
        "city": ["New York", "Los Angeles", "Chicago"],
    }
)

df.run()
DataFrame(materialized 3 rows x 3 columns)
Schema: name (string), age (int64), city (string)
Showing all rows:
name (string) | age (int64) | city (string)
--------------+-------------+--------------
Alice         | 25          | New York
Bob           | 30          | Los Angeles
Charlie       | 35          | Chicago

You can also scan files directly into a DataFrame. Below, we scan a CSV file:

from chalkdf import DataFrame
import pyarrow as pa

df = DataFrame.scan(
    "people",
    ['people.csv'],
    schema=pa.schema([
        ('name', pa.string()),
        ('birthday', pa.date32()),
        ('occupation', pa.string())
    ])
)
df.run()
DataFrame(materialized 3 rows x 3 columns)
Schema: name (string), birthday (date32[day]), occupation (string)
Showing all rows:
name (string) | birthday (date32[day]) | occupation (string)
--------------+------------------------+--------------------
Alice Chen    | 1990-03-15             | Software Engineer
Bob Smith     | 1985-07-22             | Teacher
Carol Johnson | 1992-11-08             | Data Analyst

DataFrame Expressions

DataFrame expressions allow you to transform and explore your data through a variety of operations. For example, given a DataFrame of values with id’s, you could group the values by id and compute the aggregate sum. We use the underscore notation _ to reference the current DataFrame, to easily access columns without any additional wrappers.

>>> df.run()
DataFrame(materialized 5 rows x 2 columns)
Schema: id (int64), value (int64)
Showing all rows:
id (int64) | value (int64)
-----------+--------------
1          | 10
1          | 20
2          | 1
2          | 2
2          | 3
>>> grouped = df.agg(["id"], _.value.sum().alias("value_sum_by_id"))
>>> grouped.run()
DataFrame(materialized 2 rows x 2 columns)
Schema: id (int64), value_sum_by_id (int64)
Showing all rows:
id (int64) | value_sum_by_id (int64)
-----------+------------------------
1          | 30
2          | 6

Below are some common operations you can use to construct DataFrame expressions.

select

DataFrame.select allows you to extract specific columns from a DataFrame.

>>> interactions = df.select("user_id", "item_id", "interaction_type", "timestamp")
>>> interactions.run()
DataFrame(materialized 5 rows x 4 columns)
Schema: user_id (string), item_id (string), interaction_type (string), timestamp (timestamp[us])
Showing all rows:
user_id (string) | item_id (string) | interaction_type (string) | timestamp (timestamp[us])
-----------------+------------------+---------------------------+--------------------------
user_001         | item_4521        | click                     | 2024-11-08 14:23:15
user_002         | item_8832        | purchase                  | 2024-11-08 15:10:42
user_003         | item_1203        | view                      | 2024-11-08 16:05:30
user_004         | item_4521        | add_to_cart               | 2024-11-08 17:45:12
user_005         | item_9944        | click                     | 2024-11-08 18:20:05

with_columns

To add new columns or modify existing columns in your DataFrame while retaining all original columns, you can use DataFrame.with_columns.

>>> processed_interactions = df.with_columns({"user_id": _.user_id, "long_session": _.session_duration_sec > 180})
>>> processed_interactions.select("user_id", "session_duration_sec", "long_session").run()
DataFrame(materialized 5 rows x 3 columns)
Schema: user_id (string), session_duration_sec (int64), long_session (bool)
Showing all rows:
user_id (string) | session_duration_sec (int64) | long_session (bool)
-----------------+------------------------------+--------------------
user_001         | 145                          | False
user_002         | 320                          | True
user_003         | 78                           | False
user_004         | 210                          | True
user_005         | 167                          | False

project

To define more complex transformations without retaining all original columns, you can use DataFrame.project to take in a DataFrame as input and project to new columns based on provided expressions. In these expressions, you can use the underscore notation _ to reference columns within your source DataFrame, and utilize the chalk.functions library for a variety of operations.

tbl = pa.table(
  {
    "txns_last_hour": [[1, 2, 3, 4, 5], [100], [200, 201]],
    "max_txns_allowed": [3, 5, 4],
  }
)
df = DataFrame.from_arrow(tbl)
out = df.project(
  {
    "velocity_score": _.txns_last_hour
      .cardinality()
      .cast(float)
      .least(_.max_txns_allowed + 0.5)
      .ceil()
      .cast(int),
    "velocity_score_2": F.cast(
      F.ceil(
        F.least(
          F.cast(F.cardinality(_.txns_last_hour), float),
          _.max_txns_allowed + 0.5
        )
      ),
      int,
    ),
  }
)
out.run()
DataFrame(materialized 3 rows x 2 columns)
Schema: velocity_score (int64), velocity_score_2 (int64)
Showing all rows:
velocity_score (int64) | velocity_score_2 (int64)
-----------------------+-------------------------
4                      | 4
1                      | 1
2                      | 2

filter

You can filter rows in a DataFrame using DataFrame.filter, which takes in a boolean expression.

>>> df.run()
DataFrame(materialized 5 rows x 7 columns)
Schema: user_id (string), item_id (string), interaction_type (string), timestamp (timestamp[us]), score (double), category (string), session_duration_sec (int64)
Showing all rows:
user_id (string) | item_id (string) | interaction_type (string) | timestamp (timestamp[us]) | score (double) | category (string) | session_duration_sec (int64)
-----------------+------------------+---------------------------+---------------------------+----------------+-------------------+-----------------------------
user_001         | item_4521        | click                     | 2024-11-08 14:23:15       | 0.85           | electronics       | 145
user_002         | item_8832        | purchase                  | 2024-11-08 15:10:42       | 0.92           | fashion           | 320
user_003         | item_1203        | view                      | 2024-11-08 16:05:30       | 0.67           | home              | 78
user_004         | item_4521        | add_to_cart               | 2024-11-08 17:45:12       | 0.78           | electronics       | 210
user_005         | item_9944        | click                     | 2024-11-08 18:20:05       | 0.81           | sports            | 167
>>> df.filter(_.score > 0.8).run()
DataFrame(materialized 3 rows x 7 columns)
Schema: user_id (string), item_id (string), interaction_type (string), timestamp (timestamp[us]), score (double), category (string), session_duration_sec (int64)
Showing all rows:
user_id (string) | item_id (string) | interaction_type (string) | timestamp (timestamp[us]) | score (double) | category (string) | session_duration_sec (int64)
-----------------+------------------+---------------------------+---------------------------+----------------+-------------------+-----------------------------
user_001         | item_4521        | click                     | 2024-11-08 14:23:15       | 0.85           | electronics       | 145
user_002         | item_8832        | purchase                  | 2024-11-08 15:10:42       | 0.92           | fashion           | 320
user_005         | item_9944        | click                     | 2024-11-08 18:20:05       | 0.81           | sports            | 167

agg

To compute aggregations over groups of data, you can use DataFrame.agg.

>>> processed_interactions.agg(["long_session"], processed_interactions.column("score").mean().alias("avg_score")).run()
DataFrame(materialized 2 rows x 2 columns)
Schema: long_session (bool), avg_score (double)
Showing all rows:
long_session (bool) | avg_score (double)
--------------------+-------------------
False               | 0.776667
True                | 0.85

Working with Multiple DataFrames

join

You can combine data from multiple DataFrames using DataFrame.join. Below is an example of joining two DataFrames using an inner join on the user_id column. You can also specify right and left joins.

>>> txns_df.join(
...     users_df,
...     on=["user_id"],
...     how="inner"
... ).select(
...     "transaction_id",
...     "user_id",
...     "name",
...     "amount",
...     "tier",
...     "status"
... ).run()
DataFrame(materialized 5 rows x 6 columns)
Schema: transaction_id (string), user_id (string), name (string), amount (double), tier (string), status (string)
Showing all rows:
transaction_id (string) | user_id (string) | name (string) | amount (double) | tier (string) | status (string)
------------------------+------------------+---------------+-----------------+---------------+----------------
txn_101                 | user_001         | Alice         | 49.99           | premium       | completed
txn_102                 | user_002         | Bob           | 19.99           | basic         | completed
txn_103                 | user_001         | Alice         | 89.5            | premium       | pending
txn_104                 | user_003         | Charlie       | 120             | premium       | completed
txn_105                 | user_001         | Alice         | 15.75           | premium       | completed