Feature Caching

When a feature is expensive or slow to compute, you may wish to cache its value in the online store. Chalk uses the terminology “maximum staleness” to describe how recently a feature value needs to have been computed for the value in the online store to be returned rather than recomputing a fresh feature value by running a resolver. The online store is typically best suited for low-latency reads on a smaller amount of data relative to the offline store.

You can specify the maximum staleness for a feature as follows:

from chalk.features import feature, features
from datetime import timedelta

@features
class User:
    # Using text descriptors:
    expensive_fraud_score: float = feature(
        max_staleness="1m 30s"
    )

    # Alternatively, using timedelta:
    expensive_fraud_score: float = feature(
        max_staleness=timedelta(minutes=1, seconds=30)
    )

Max staleness durations can be given in natural language, or specified using datetime.timedelta. You can specify a max staleness of “infinity” to indicate that Chalk should cache computed feature values forever. This makes sense for data that never becomes invalid, or for data that you wish to explicitly update using Streaming Updates or Reverse ETL.

Staleness can also be assigned to all features in a namespace:

from chalk.features import features, feature

@features(max_staleness="1d")
class User:
    id: int
    fraud_score: float
    full_name: str
    email: str = feature(max_staleness="0s")
    ...

@features(max_stasleness="infinity")
class UserReport:
    user_id: Primary[User.id]
    report_data: str
    created_at: datetime
    ...

Here, User.fraud_score and User.full_name assume the max-staleness of 1d. However, User.email, which specifies max-staleness at the feature level, assumes the max-staleness of 0s, forcing it to be recomputed on every request.

By default, features are not cached, and instead are recomputed for every online request. In effect, you can think of max_staleness as being 0 except where otherwise specified.

Caching indefinitely

If you want to cache a feature indefinitely, you can set the max_staleness to "infinity":

from chalk.features import feature, features

@features
class User:
    id: int
    fraud_score: float = feature(max_staleness="infinity")

In this example, the latest computed value for User.fraud_score will be cached indefinitely in the online store. When a new value is computed, it will replace the old value in the online store.

Populating the Online Store

Once you have set the max staleness for a feature, there are several ways to populate the online store, depending on whether you want to just cache recently computed feature values or if you want to ensure that your queries utilize the low-latency path of online store lookup.

Online Queries: When you run an online query, the Chalk engine will check the online store for feature values that fall within the max staleness duration for a feature. If a feature value is found then the engine will return that value. Else, the engine will run the associated resolver to compute a fresh feature value and store the newly computed feature value in the online store.
Offline Queries: When you run an offline query with the parameter store_online=True, the feature values computed in the offline query output will be loaded into the online store.
Triggered Resolver Run: You can trigger a resolver run with the parameter store_online=True to populate the online store with the feature values computed in the resolver run.
Scheduled Queries: You can create a ScheduledQuery with the parameter store_online=True to populate the online store with the feature values computed in the scheduled query.
Dataset Ingest: You can ingest a Dataset to the online store using the method dataset.ingest(store_online=True).
ETL Offline to Online: You can set the etl_offline_to_online parameter to True in an @online or @offline scheduled resolver to populate the online store with the feature values computed in the resolver for features with max_staleness != 0.
Stream: You can stream feature values to the online store for features with max_staleness != 0.

Handling null and default values

For features with max_staleness != 0, you can also specify how you want to handle null and default feature values. By default, Chalk will cache the computed feature value, even if it is null or the default value, however if you set the cache_null or cache_default parameter to False, Chalk will not cache the null/default computed feature value. Furthermore, if your online store is Redis or DynamoDB, you can also set cache_nulls="evict_nulls" to evict cached null feature values from the online store, and cache_defaults="evict_defaults" to evict cached default feature values from the online store. Read more about how to handle null and default value caching here.

Caching Cookbook

Given all of these options, which recipe should you follow for what data to cache in the online store and how to load that data into the online store?

First, there are a few general principles to keep in mind:

The online store is optimized for low-latency reads, so you should cache data that you want to query quickly, usually for real-time use cases.
The online store is not optimized for huge amounts of data, so you should cache only the data that you need to query quickly.
When you run an online query, if the output feature values are not found in the online store, the online resolver will run. When you run an offline query, if the output feature values are not found in the offline store, an offline resolver will run, or if there is no offline resolver then the online resolver will run.

Keeping these principles in mind, here are some common use cases and how to handle them:

If you know the specific primary keys for the data that you would like to query quickly, you can
- Run offline queries or dataset ingest to load data ad-hoc: Given the set of feature values that you would like to load into the online store, you can run offline queries with store_online=True or ingest a dataset to the online store using dataset.ingest(store_online=True).
- Schedule or orchestrate queries and ingests: If you would like to regularly update the feature values in the online store, you can also orchestrate the offline queries and dataset ingests to run at specific intervals, or use a ScheduledQuery.
If you would like to cache all possible values for a feature in the online store, you can
- Trigger a resolver run: A resolver run with store_online=True will populate the online store with all possible feature values for the resolver’s output features.
If you would like to cache recently computed data, but do not have a specific concept in mind of what data you would like to cache, you can
- Run online queries: Running online queries will populate the online store with the feature values that fall within the max staleness duration for the feature.
- Set etl_offline_to_online in your resolvers: If you have resolvers that compute features with max_staleness != 0, you can set etl_offline_to_online=True to populate the online store with the feature values computed in the resolver run.
If you have a streaming data source
- Stream feature values to the online store: You can stream feature values to the online store for features with max_staleness != 0.

Overriding default caching

The max_staleness values provided to the feature function may be overridden at the time of querying for features. See Overriding Default Caching for a detailed discussion.

Specifying Online and Offline Storage

Chalk enables granular configuration about whether and how to store feature values in the online and offline stores. For each feature class, the @features decorator can be used to specify whether the scalar features in the feature class should be cached in the online store based on the max_staleness parameter. However, the feature function enables per-feature specification of whether values computed for that feature within a feature class should be stored in the online and offline stores using the store_online and store_offline parameters. The following example showcases how these parameters can be used in combination to express different storage behavior.

from chalk.features import features, feature, DataFrame
from datetime import datetime

# due to the max_staleness parameter, all scalar features in the Driver feature class will be cached in the online
# store with a max staleness of 30 days. In addition--all of the computed feature values will be stored offline by
# default.
@features(max_staleness="30d")
class Driver:
    id: int

    # with no overrides, this feature will be stored online with max staleness of 30 days, and stored offline
    name: str

    # with an override on store_offline, this feature will be stored online with a max staleness of 30 days, but
    # will not be stored offline
    age: int = feature(store_offline=False)

    # with overrides for store_online and store_offline, computed feature values for location will not be persisted
    # in either the online or offline store
    location: str = feature(store_online=False, store_offline=False)

    # joined features are not scalar features, and hence do not inherit the storage behavior of the feature class.
    # because neither the Job feature class nor the
    jobs: "DataFrame[Job]"

    # because the Record feature class has a max staleness of 30 days, and so does the join, this join
    # would be cached on the Driver feature class
    records: "DataFrame[Record]" = has_many(lambda: Record.driver_id == Driver.id, max_staleness="30d")

# with no max_staleness set, features in the Job feature class will by default by stored offline, but not online
@features
class Job:
    id: int

    # with no overrides, this feature will be stored offline, but not online
    driver_id: Driver.id

    # with an override for store_offline, computed feature values for start_time will not be persisted
    start_time: datetime = feature(store_offline=False)

    # with overrides for store_online and store_offline, computed feature values for end_time will be persisted
    # in the online store, but not the offline store
    end_time: datetime = feature(store_online=True, store_offline=False)

# with max_staleness set, all scalar features in the Record feature class will be cached in the online store with a
# max staleness of 30 days and stored offline by default.
@features(max_staleness="30d")
class Record:
    id: int

    # with an override on max_staleness, all other features in the feature class would be cached with a max
    # staleness of 30 days, but driver_id would be cached with a max staleness of 1 day.
    driver_id: Driver.id = feature(max_staleness="1d")
    timestamp: datetime
    record_details: str

Removing Cached Feature Values

There are two ways to remove cached feature values from the online store. If you would like to remove all cached values for a specific feature or set of features, then you can use chalk drop to remove all cached values such that none of those values would be eligible to be served from the online store. If you would like to remove a specific cached value for a specific primary key, then you can use [chalk delete](cli/delete).

For example, if you have the feature class User.risk_score_1 and you want to remove an incorrectly computed cached value for the user with primary key 123, you can run the following command:

chalk delete --keys=123 --features user.risk_score_1

If you have updated how risk_score_1 is computed and want to remove all cached values for the feature, you can then run the following:

chalk drop --features user.risk_score_1
chalk apply

LRU Caching

Chalk also supports tiered caching using an LRU cache in front of the online store. To enable LRU caching, you can create an OnlineStoreConfig with an lru_cache, and then pass that configuration to the feature classes that you want to cache:

from chalk.features import features
from chalk.stores import LRUCache, OnlineStoreConfig

lru_cache_config = OnlineStoreConfig(
    lru_cache=LRUCache(
        max_size=10000,
        ttl="60s",
        store_cache_misses=True,
    )
)

@features(online_store_config=lru_cache_config)
class User:
    id: int
    risk_score: float = feature(max_staleness="1h")

In this example, the User feature class is configured to use an LRU cache with a maximum size of 10,000 entries. Every worker process will maintain its own LRU cache. If a feature value is found in the LRU cache, it will be returned directly.

If it is not found in the LRU cache, the online store will be queried. If it is not found in the online store, the resolver will be run to compute a fresh value. If the store_cache_misses parameter is set to true, then a cache miss against the online store is stored in the LRU cache: subsequent requests for the same primary key will not hit the online store until the TTL of the LRU cache for the pkey expires.

​Caching indefinitely

​Populating the Online Store

​Handling null and default values

​Caching Cookbook

​Overriding default caching

​Specifying Online and Offline Storage

​Removing Cached Feature Values

​LRU Caching

On this page