DataFrame
Install and run your first chalkdf commands
This guide will walk you through installing chalkdf and performing some basic operations.
If you are new to chalkdf or DataFrame libraries, this is a great place to start. If you
are already familiar with DataFrame libraries you can skip ahead to the full
Reference Library or the installation guide
Given Python 3.10-3.13 and Linux/macOS 15+, you can install chalkdf via pip:
pip install "chalkdf[chalkpy]"chalkdf supports reading data from several formats and sources, including JSON, CSV, and Parquet files,
Arrow objects and tables, as well as AWS Glue Iceberg tables, and SQL sources defined via
chalkpy.
Below we can initialize a chalkdf.DataFrame from a Python dictionary.
from chalkdf import DataFrame
df = DataFrame(
{
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"city": ["New York", "Los Angeles", "Chicago"],
}
)
df.run()DataFrame(materialized 3 rows x 3 columns)
Schema: name (string), age (int64), city (string)
Showing all rows:
name (string) | age (int64) | city (string)
--------------+-------------+--------------
Alice | 25 | New York
Bob | 30 | Los Angeles
Charlie | 35 | Chicago
You can also scan files directly into a DataFrame. Below, we scan a CSV file:
from chalkdf import DataFrame
import pyarrow as pa
df = DataFrame.scan(
"people",
['people.csv'],
schema=pa.schema([
('name', pa.string()),
('birthday', pa.date32()),
('occupation', pa.string())
])
)
df.run()DataFrame(materialized 3 rows x 3 columns)
Schema: name (string), birthday (date32[day]), occupation (string)
Showing all rows:
name (string) | birthday (date32[day]) | occupation (string)
--------------+------------------------+--------------------
Alice Chen | 1990-03-15 | Software Engineer
Bob Smith | 1985-07-22 | Teacher
Carol Johnson | 1992-11-08 | Data Analyst
DataFrame expressions allow you to transform and explore your data through a variety of operations.
For example, given a DataFrame of values with id’s, you could group the values by id and compute the
aggregate sum. We use the underscore notation _ to reference the current DataFrame, to easily
access columns without any additional wrappers.
>>> df.run()
DataFrame(materialized 5 rows x 2 columns)
Schema: id (int64), value (int64)
Showing all rows:
id (int64) | value (int64)
-----------+--------------
1 | 10
1 | 20
2 | 1
2 | 2
2 | 3
>>> grouped = df.agg(["id"], _.value.sum().alias("value_sum_by_id"))
>>> grouped.run()
DataFrame(materialized 2 rows x 2 columns)
Schema: id (int64), value_sum_by_id (int64)
Showing all rows:
id (int64) | value_sum_by_id (int64)
-----------+------------------------
1 | 30
2 | 6Below are some common operations you can use to construct DataFrame expressions.
DataFrame.select allows you to extract specific columns from a DataFrame.
>>> interactions = df.select("user_id", "item_id", "interaction_type", "timestamp")
>>> interactions.run()
DataFrame(materialized 5 rows x 4 columns)
Schema: user_id (string), item_id (string), interaction_type (string), timestamp (timestamp[us])
Showing all rows:
user_id (string) | item_id (string) | interaction_type (string) | timestamp (timestamp[us])
-----------------+------------------+---------------------------+--------------------------
user_001 | item_4521 | click | 2024-11-08 14:23:15
user_002 | item_8832 | purchase | 2024-11-08 15:10:42
user_003 | item_1203 | view | 2024-11-08 16:05:30
user_004 | item_4521 | add_to_cart | 2024-11-08 17:45:12
user_005 | item_9944 | click | 2024-11-08 18:20:05To add new columns or modify existing columns in your DataFrame while retaining all original
columns, you can use DataFrame.with_columns.
>>> processed_interactions = df.with_columns({"user_id": _.user_id, "long_session": _.session_duration_sec > 180})
>>> processed_interactions.select("user_id", "session_duration_sec", "long_session").run()
DataFrame(materialized 5 rows x 3 columns)
Schema: user_id (string), session_duration_sec (int64), long_session (bool)
Showing all rows:
user_id (string) | session_duration_sec (int64) | long_session (bool)
-----------------+------------------------------+--------------------
user_001 | 145 | False
user_002 | 320 | True
user_003 | 78 | False
user_004 | 210 | True
user_005 | 167 | FalseTo define more complex transformations without retaining all original columns, you can use DataFrame.project to take
in a DataFrame as input and project to new columns based on provided expressions. In these expressions, you can use the
underscore notation _ to reference columns within your source DataFrame, and utilize the chalk.functions
library for a variety of operations.
tbl = pa.table(
{
"txns_last_hour": [[1, 2, 3, 4, 5], [100], [200, 201]],
"max_txns_allowed": [3, 5, 4],
}
)
df = DataFrame.from_arrow(tbl)
out = df.project(
{
"velocity_score": _.txns_last_hour
.cardinality()
.cast(float)
.least(_.max_txns_allowed + 0.5)
.ceil()
.cast(int),
"velocity_score_2": F.cast(
F.ceil(
F.least(
F.cast(F.cardinality(_.txns_last_hour), float),
_.max_txns_allowed + 0.5
)
),
int,
),
}
)
out.run()DataFrame(materialized 3 rows x 2 columns)
Schema: velocity_score (int64), velocity_score_2 (int64)
Showing all rows:
velocity_score (int64) | velocity_score_2 (int64)
-----------------------+-------------------------
4 | 4
1 | 1
2 | 2
You can filter rows in a DataFrame using DataFrame.filter, which takes in a boolean expression.
>>> df.run()
DataFrame(materialized 5 rows x 7 columns)
Schema: user_id (string), item_id (string), interaction_type (string), timestamp (timestamp[us]), score (double), category (string), session_duration_sec (int64)
Showing all rows:
user_id (string) | item_id (string) | interaction_type (string) | timestamp (timestamp[us]) | score (double) | category (string) | session_duration_sec (int64)
-----------------+------------------+---------------------------+---------------------------+----------------+-------------------+-----------------------------
user_001 | item_4521 | click | 2024-11-08 14:23:15 | 0.85 | electronics | 145
user_002 | item_8832 | purchase | 2024-11-08 15:10:42 | 0.92 | fashion | 320
user_003 | item_1203 | view | 2024-11-08 16:05:30 | 0.67 | home | 78
user_004 | item_4521 | add_to_cart | 2024-11-08 17:45:12 | 0.78 | electronics | 210
user_005 | item_9944 | click | 2024-11-08 18:20:05 | 0.81 | sports | 167
>>> df.filter(_.score > 0.8).run()
DataFrame(materialized 3 rows x 7 columns)
Schema: user_id (string), item_id (string), interaction_type (string), timestamp (timestamp[us]), score (double), category (string), session_duration_sec (int64)
Showing all rows:
user_id (string) | item_id (string) | interaction_type (string) | timestamp (timestamp[us]) | score (double) | category (string) | session_duration_sec (int64)
-----------------+------------------+---------------------------+---------------------------+----------------+-------------------+-----------------------------
user_001 | item_4521 | click | 2024-11-08 14:23:15 | 0.85 | electronics | 145
user_002 | item_8832 | purchase | 2024-11-08 15:10:42 | 0.92 | fashion | 320
user_005 | item_9944 | click | 2024-11-08 18:20:05 | 0.81 | sports | 167To compute aggregations over groups of data, you can use DataFrame.agg.
>>> processed_interactions.agg(["long_session"], processed_interactions.column("score").mean().alias("avg_score")).run()
DataFrame(materialized 2 rows x 2 columns)
Schema: long_session (bool), avg_score (double)
Showing all rows:
long_session (bool) | avg_score (double)
--------------------+-------------------
False | 0.776667
True | 0.85You can combine data from multiple DataFrames using DataFrame.join. Below is an example of joining two
DataFrames using an inner join on the user_id column. You can also specify right and left joins.
>>> txns_df.join(
... users_df,
... on=["user_id"],
... how="inner"
... ).select(
... "transaction_id",
... "user_id",
... "name",
... "amount",
... "tier",
... "status"
... ).run()
DataFrame(materialized 5 rows x 6 columns)
Schema: transaction_id (string), user_id (string), name (string), amount (double), tier (string), status (string)
Showing all rows:
transaction_id (string) | user_id (string) | name (string) | amount (double) | tier (string) | status (string)
------------------------+------------------+---------------+-----------------+---------------+----------------
txn_101 | user_001 | Alice | 49.99 | premium | completed
txn_102 | user_002 | Bob | 19.99 | basic | completed
txn_103 | user_001 | Alice | 89.5 | premium | pending
txn_104 | user_003 | Charlie | 120 | premium | completed
txn_105 | user_001 | Alice | 15.75 | premium | completed