Features
Set max staleness to introduce caching.
When a feature is expensive or slow to compute, you may wish to cache its value in the online store. Chalk uses the terminology “maximum staleness” to describe how recently a feature value needs to have been computed for the value in the online store to be returned rather than recomputing a fresh feature value by running a resolver. The online store is typically best suited for low-latency reads on a smaller amount of data relative to the offline store.
You can specify the maximum staleness for a feature as follows:
from chalk.features import feature, features
from datetime import timedelta
@features
class User:
# Using text descriptors:
expensive_fraud_score: float = feature(
max_staleness="1m 30s"
)
# Alternatively, using timedelta:
expensive_fraud_score: float = feature(
max_staleness=timedelta(minutes=1, seconds=30)
)
Max staleness durations can be given in natural language, or specified using datetime.timedelta. You can specify a max staleness of “infinity” to indicate that Chalk should cache computed feature values forever. This makes sense for data that never becomes invalid, or for data that you wish to explicitly update using Streaming Updates or Reverse ETL.
Staleness can also be assigned to all features in a namespace:
@features(max_staleness="1d")
class User:
fraud_score: float
full_name: str
email: str = feature(max_staleness="0s")
...
Here, User.fraud_score
and User.full_name
assume the max-staleness of 1d
.
However, User.email
, which specifies max-staleness at the feature level,
assumes the max-staleness of 0s
, forcing it to be recomputed on every request.
By default, features are not cached, and instead are recomputed for every online request.
In effect, you can think of max_staleness
as being 0
except where otherwise specified.
Once you have set the max staleness for a feature, there are several ways to populate the online store, depending on whether you want to just cache recently computed feature values or if you want to ensure that your queries utilize the low-latency path of online store lookup.
store_online=True
, the feature values computed in the offline query output will be loaded into the online
store.store_online=True
to populate the online store with the feature values computed in the resolver run.store_online=True
to populate the online store with the feature values computed in the scheduled query.dataset.ingest(store_online=True)
.etl_offline_to_online
parameter to True
in an @online
or
@offline
scheduled resolver to populate the online store with the feature values computed in the resolver for
features with max_staleness != 0
.max_staleness != 0
.For features with max_staleness != 0
, you can also specify how you want to handle null and default feature
values. By default, Chalk will cache the computed feature value, even if it is null or the default value,
however if you set the cache_null
or cache_default
parameter to False
, Chalk will not cache the
null/default computed feature value. Furthermore, if your online store is Redis or DynamoDB, you can also
set cache_nulls="evict_nulls"
to evict cached null feature values from the online store, and
cache_defaults="evict_defaults"
to evict cached default feature values from the online store.
Read more about how to handle null and default value caching here.
Given all of these options, which recipe should you follow for what data to cache in the online store and how to load that data into the online store?
First, there are a few general principles to keep in mind:
Keeping these principles in mind, here are some common use cases and how to handle them:
store_online=True
or ingest a dataset to the online store using dataset.ingest(store_online=True)
.store_online=True
will populate the online store
with all possible feature values for the resolver’s output features.etl_offline_to_online
in your resolvers: If you have resolvers that
compute features with max_staleness != 0
, you can set etl_offline_to_online=True
to populate the
online store with the feature values computed in the resolver run.max_staleness != 0
.The max_staleness
values provided to the feature
function
may be overridden at the time of querying for features.
See Overriding Default Caching for a detailed discussion.