Chalk home page
Docs
API
CLI
  1. Features
  2. Windowed

Windowed features are features defined over time ranges. For example, you can use windowed features to count the number of login attempts made by a user over the past 10 minutes, or to track the largest purchase amount a cardholder has made in the past 30 days.

Feature definition

Here is an example of a windowed feature representing the number of failed logins in the last 10 minutes, 30 minutes, and 1 day:

@features
class User:
    id: int
    num_failed_logins: Windowed[int] = windowed(
        "10m",
        "30m",
        "1d",
        max_staleness="10m",
        default=0,
        owner="trust-and-safety",
    )

Windowed features support much of the same functionality as a normal feature. Like features for mapping resolvers, windowed features are most often used alongside max_staleness and etl_offline_to_online to allow the features to be sent to online store and offline store after each window period. Windowed features often use default to set a default value to return when there are no messages within a time period.

Referencing windowed features

A windowed feature can be referenced in a query or a resolver in the following, equivalent ways. Each column below shows the possible syntax variants for a given time window.

# Note: The last value for each list is the time converted to seconds
User.num_failed_logins("10m")    User.num_failed_logins("1d")       User.num_failed_logins("1h30m")
User.num_failed_logins["10m"]    User.num_failed_logins["1d"]       User.num_failed_logins["1h30m"]
User.num_failed_logins_10m       User.num_failed_logins_1d          User.num_failed_logins_1h30m
User.num_failed_logins__10m__    User.num_failed_logins__1d__       User.num_failed_logins__1h30m__
User.num_failed_logins__600__    User.num_failed_logins__86400__    User.num_failed_logins__5400__

Windowed features can be inputs to resolvers:

@online
def account_under_attack(
    failed_logins_30m: User.num_failed_logins('30m'),
    failed_logins_1d: User.num_failed_logins('1d')
) -> ...:
    return failed_logins_30m > 10 or failed_logins_1d > 100

Grouping windowed features

Similar to SQL GROUP BY clauses, you can group your windowed feature by one or more other features with group_by_windowed.

Here’s an example where we track a cardholder’s spend in the last 30 and 90 days, grouped by mcc (merchant category code):

@features
class User:
    id: int
    transactions: DataFrame[Transaction]
    spend_by_category: DataFrame = group_by_windowed(
        "30d",
        "90d",
        materialization={
            "bucket_duration": "1d",
        },
        expression=_.transactions.group_by(_.mcc).agg(_.amount.sum()),
    )

Windowed aggregation for performance

Windowed features are typically computed using either raw data or pre-aggregated data. Raw data has the most accuracy, but can be slow if you request longer time windows or large volumes of data. Some systems improve performance by serving features from pre-aggregated batch data. Pre-aggregated data mitigates the performance issue by reducing the number of data points needed, but prevents your application from accessing the newest data entering your system.

Chalk balances accuracy and performance by combining both approaches. We aggregate historical data while continuously updating as new data arrives. To have Chalk aggregate your data, pass materialization to your windowed feature and use bucket_duration to set the size of each aggregated time window:

@features
class User:
    id: int
    transactions: DataFrame[transactions],
    total_transaction_amount: Windowed[int] = windowed(
        "10d",
        "90d",
        materialization={"bucket_duration": "1d"},
	 expression=_.transactions[_.amount].sum(),
    )

Because this code shows a bucket_duration of 1 day, Chalk will aggregate transaction data into 1 day buckets of timeseries data. Chalk will then use this timeseries data to serve total_transaction_amount for the past 30 days and 90 days. As new data arrives, the relevant timeseries buckets are modified to include the data. Over time, the oldest buckets are removed from your online store once they cannot be used by any time window.

The number of buckets is determined by your longest time window divided by your bucket duration. For example, if your time window is 90 days and your bucket duration is 1 day, you would have 90 buckets. If your bucket duration is set to 1 minute, you would instead have 129,600 buckets.

Buckets are aligned starting from the Unix epoch, ignoring leap seconds. To serve windowed feature queries, Chalk uses all buckets containing any overlap with the requested time window.

Managing aggregations

Use chalk aggregate backfill to backfill aggregation for a windowed feature. This command is useful if you change your feature’s time windows or bucket_duration values.

To view existing aggregations, use chalk aggregate list.

Each of these commands outputs a table of your aggregations:

 Series  Namespace    Group                Agg     Bucket  Retention  Aggregation  Dependent Features
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 1       transaction  user_id merchant_id  amount  1d      30d        sum          user.txn_sum_by_merchant merchant.txn_sum_by_user
 1       transaction  user_id merchant_id  amount  1d      30d        count        user.txn_count_by_merchant
 2       transaction  user_id              amount  1d      30d        sum          user.txn_sum

The series column shows the unique ID of the timeseries data underlying our aggregation system. Each unique combination of namespace, group (see group_by_windowed), and agg (value to aggregate) columns represents a separate timeseries. When possible, Chalk will use the same timeseries data to serve multiple features.

Bucket shows the current bucket size. Retention shows the maximum time window of any feature that depends on the given timeseries. Dependent features list the features that are served by the given timeseries.

Windowed streaming

To learn more about using windowed features with streaming data sources, see our documentation on windowed streaming.