Feature Engine
Leverage pre-materialized aggregate tiles to efficiently build training sets for windowed features.
Building training sets for windowed aggregation features is one of the most expensive parts of the offline ML pipeline. A naive point-in-time join has to scan every underlying event per spine row, which scales poorly as either your historical spine or your event table grows.
Chalk can re-use the same materialized aggregation tiles that back online serving to make offline training set construction dramatically cheaper. By binning historical events into fixed-width partial-aggregate buckets once, Chalk can then answer arbitrarily many windowed queries by merging a small number of pre-computed tiles per row — rather than re-scanning the raw events for every row of the training spine.
When a feature is defined with a materialization config, Chalk maintains a schedule that
periodically backfills partial aggregate state into object storage — one row per
(grouping_key, bucket_start) pair, storing the minimal state required to reconstitute the
aggregation (e.g. running sum and count for a mean).
At offline query time, you can tell the planner to read directly from these tiles rather than re-aggregating the raw event table:
from chalk.client import ChalkClient
client = ChalkClient()
ds = client.offline_query(
inputs={"user.id": big_input_list},
input_times=big_input_time_list,
output=[
User.total_transaction_amount["7d"],
User.total_transaction_amount["30d"],
User.mean_transaction_amount["7d"],
],
planner_options={"use_materialized_offline_query": True},
)With use_materialized_offline_query=True, Chalk bins each spine row’s input_time to a
bucket boundary, loads the partial aggregate tiles that overlap each requested window, and
merges them to produce the final aggregation.
Because the underlying tiles are window-agnostic, adding "14d" to a feature that previously
declared "7d" and "30d" is almost free — Chalk merges a different subset of the
already-materialized tiles. No new scan of the raw event table is required.
Partial aggregate state often contains more information than the feature you originally
exposed. For example, if you have materialized sum and count for a column, you can
compute mean by merging those two tile streams — again without re-reading any raw events.
Consider computing merchant.count_transactions_last_7d on a historical spine where
transaction.timestamp has millions of events per merchant per day. A naive backfill has to,
for each spine (merchant, timestamp) pair, scan every transaction within the preceding
seven days — a pattern that scales roughly quadratically on heavy-hitter merchants,
since both the number of spine rows and the number of events per window grow with activity.
Pre-materialized tiles flatten this. Each merchant has at most
window / bucket_duration tiles overlapping any given 7-day window, so the work per spine
row becomes constant, and the backfill scales linearly in the number of bucket points
rather than in the product of spine rows and events per window.
Tiles are aligned on fixed bucket boundaries (see bucket_duration in
MaterializationWindowConfig).
When an offline query requests a window that does not fall exactly on a bucket boundary,
Chalk must decide how to treat the partial bucket at the leading or trailing edge of the
window. With use_materialized_offline_query=True, Chalk aggregates over all complete
buckets that intersect the window interval, which means:
window_interval requested.input_time falls in the same bucket will see identical aggregate
values.For most training-set use cases this is an acceptable trade-off given the order-of-magnitude
improvement in backfill time, but you should pick a bucket_duration that is small relative
to your window intervals when accuracy matters. If you need the offline computation to
exactly mirror the rounding behavior of online serving, also set
align_offline_chalk_window_with_materialization=True.
Shorter bucket durations yield more accurate offline results at the cost of more storage and more tiles to merge at query time. For most production windowed features, a bucket duration between 1% and 10% of the shortest window interval is a good starting point.
`use_materialized_offline_query` requires that a materialization schedule has been configured for the feature and that sufficient historical tiles have been backfilled to cover the requested `input_times`. If tiles are missing for part of the requested range, Chalk will fall back to running resolvers for that range.
chalk aggregate backfill — CLI for triggering a backfill of
aggregate tiles.