Training Sets with Materialized Aggregations

Building training sets for windowed aggregation features is one of the most expensive parts of the offline ML pipeline. A naive point-in-time join has to scan every underlying event per spine row, which scales poorly as either your historical spine or your event table grows.

Chalk can re-use the same materialized aggregation tiles that back online serving to make offline training set construction dramatically cheaper. By binning historical events into fixed-width partial-aggregate buckets once, Chalk can then answer arbitrarily many windowed queries by merging a small number of pre-computed tiles per row — rather than re-scanning the raw events for every row of the training spine.

How it works

When a feature is defined with a materialization config, Chalk maintains a schedule that periodically backfills partial aggregate state into object storage — one row per (grouping_key, bucket_start) pair, storing the minimal state required to reconstitute the aggregation (e.g. running sum and count for a mean).

At offline query time, you can tell the planner to read directly from these tiles rather than re-aggregating the raw event table:

from chalk.client import ChalkClient

client = ChalkClient()

ds = client.offline_query(
    inputs={"user.id": big_input_list},
    input_times=big_input_time_list,
    output=[
        User.total_transaction_amount["7d"],
        User.total_transaction_amount["30d"],
        User.mean_transaction_amount["7d"],
    ],
    planner_options={"use_materialized_offline_query": True},
)

With use_materialized_offline_query=True, Chalk bins each spine row’s input_time to a bucket boundary, loads the partial aggregate tiles that overlap each requested window, and merges them to produce the final aggregation.

When this is especially useful

Adding a new window to an existing aggregation

Because the underlying tiles are window-agnostic, adding "14d" to a feature that previously declared "7d" and "30d" is almost free — Chalk merges a different subset of the already-materialized tiles. No new scan of the raw event table is required.

Deriving new features from existing state

Partial aggregate state often contains more information than the feature you originally exposed. For example, if you have materialized sum and count for a column, you can compute mean by merging those two tile streams — again without re-reading any raw events.

High-time-point-density backfills over heavy-hitter keys

Consider computing merchant.count_transactions_last_7d on a historical spine where transaction.timestamp has millions of events per merchant per day. A naive backfill has to, for each spine (merchant, timestamp) pair, scan every transaction within the preceding seven days — a pattern that scales roughly quadratically on heavy-hitter merchants, since both the number of spine rows and the number of events per window grow with activity.

Pre-materialized tiles flatten this. Each merchant has at most window / bucket_duration tiles overlapping any given 7-day window, so the work per spine row becomes constant, and the backfill scales linearly in the number of bucket points rather than in the product of spine rows and events per window.

Trade-off: bucket granularity coarsens accuracy

Tiles are aligned on fixed bucket boundaries (see bucket_duration in MaterializationWindowConfig). When an offline query requests a window that does not fall exactly on a bucket boundary, Chalk must decide how to treat the partial bucket at the leading or trailing edge of the window. With use_materialized_offline_query=True, Chalk aggregates over all complete buckets that intersect the window interval, which means:

The effective window for a row may be slightly wider or narrower than the literal window_interval requested.
Two spine rows whose input_time falls in the same bucket will see identical aggregate values.

For most training-set use cases this is an acceptable trade-off given the order-of-magnitude improvement in backfill time, but you should pick a bucket_duration that is small relative to your window intervals when accuracy matters. If you need the offline computation to exactly mirror the rounding behavior of online serving, also set align_offline_chalk_window_with_materialization=True.

Shorter bucket durations yield more accurate offline results at the cost of more storage and more tiles to merge at query time. For most production windowed features, a bucket duration between 1% and 10% of the shortest window interval is a good starting point.

`use_materialized_offline_query` requires that a materialization schedule has been configured for the feature and that sufficient historical tiles have been backfilled to cover the requested `input_times`. If tiles are missing for part of the requested range, Chalk will fall back to running resolvers for that range.

Materialized Windowed Aggregations — how tiles are defined, scheduled, and served online.
Offline Queries — other planner options for optimizing aggregations in offline queries.
chalk aggregate backfill — CLI for triggering a backfill of aggregate tiles.

​How it works

​When this is especially useful

​Adding a new window to an existing aggregation

​Deriving new features from existing state

​High-time-point-density backfills over heavy-hitter keys

​Trade-off: bucket granularity coarsens accuracy

​Related reading