Features
Set max staleness to introduce caching.
When a feature is expensive or slow to compute, you may wish to cache its value in the online store. Chalk uses the terminology “maximum staleness” to describe how recently a feature value needs to have been computed for the value in the online store to be returned rather than recomputing a fresh feature value by running a resolver. The online store is typically best suited for low-latency reads on a smaller amount of data relative to the offline store.
You can specify the maximum staleness for a feature as follows:
from chalk.features import feature, features
from datetime import timedelta
@features
class User:
# Using text descriptors:
expensive_fraud_score: float = feature(
max_staleness="1m 30s"
)
# Alternatively, using timedelta:
expensive_fraud_score: float = feature(
max_staleness=timedelta(minutes=1, seconds=30)
)
Max staleness durations can be given in natural language, or specified using datetime.timedelta. You can specify a max staleness of “infinity” to indicate that Chalk should cache computed feature values forever. This makes sense for data that never becomes invalid, or for data that you wish to explicitly update using Streaming Updates or Reverse ETL.
Staleness can also be assigned to all features in a namespace:
@features(max_staleness="1d")
class User:
fraud_score: float
full_name: str
email: str = feature(max_staleness="0s")
...
Here, User.fraud_score
and User.full_name
assume the max-staleness of 1d
.
However, User.email
, which specifies max-staleness at the feature level,
assumes the max-staleness of 0s
, forcing it to be recomputed on every request.
By default, features are not cached, and instead are recomputed for every online request.
In effect, you can think of max_staleness
as being 0
except where otherwise specified.
Once you have set the max staleness for a feature, there are several ways to populate the online store, depending on whether you want to just cache recently computed feature values or if you want to ensure that your queries utilize the low-latency path of online store lookup.
store_online=True
, the feature values computed in the offline query output will be loaded into the online
store.store_online=True
to populate the online store with the feature values computed in the resolver run.store_online=True
to populate the online store with the feature values computed in the scheduled query.dataset.ingest(store_online=True)
.etl_offline_to_online
parameter to True
in an @online
or
@offline
scheduled resolver to populate the online store with the feature values computed in the resolver for
features with max_staleness != 0
.max_staleness != 0
.For features with max_staleness != 0
, you can also specify how you want to handle null and default feature
values. By default, Chalk will cache the computed feature value, even if it is null or the default value,
however if you set the cache_null
or cache_default
parameter to False
, Chalk will not cache the
null/default computed feature value. Furthermore, if your online store is Redis or DynamoDB, you can also
set cache_nulls="evict_nulls"
to evict cached null feature values from the online store, and
cache_defaults="evict_defaults"
to evict cached default feature values from the online store.
Read more about how to handle null and default value caching here.
Given all of these options, which recipe should you follow for what data to cache in the online store and how to load that data into the online store?
First, there are a few general principles to keep in mind:
Keeping these principles in mind, here are some common use cases and how to handle them:
store_online=True
or ingest a dataset to the online store using dataset.ingest(store_online=True)
.store_online=True
will populate the online store
with all possible feature values for the resolver’s output features.etl_offline_to_online
in your resolvers: If you have resolvers that
compute features with max_staleness != 0
, you can set etl_offline_to_online=True
to populate the
online store with the feature values computed in the resolver run.max_staleness != 0
.The max_staleness
values provided to the feature
function
may be overridden at the time of querying for features.
See Overriding Default Caching for a detailed discussion.
Chalk enables granular configuration about whether and how to store feature values in the online and offline stores.
For each feature class, the @features
decorator can be used to specify whether the scalar features in the feature
class should be cached in the online store based on the max_staleness
parameter. However, the feature
function
enables per-feature specification of whether values computed for that feature within a feature class should be stored
in the online and offline stores using the store_online
and store_offline
parameters. The following example
showcases how these parameters can be used in combination to express different storage behavior.
from chalk.features import features, feature, DataFrame
from datetime import datetime
# due to the max_staleness parameter, all scalar features in the Driver feature class will be cached in the online
# store with a max staleness of 30 days. In addition--all of the computed feature values will be stored offline by
# default.
@features(max_staleness="30d")
class Driver:
id: int
# with no overrides, this feature will be stored online with max staleness of 30 days, and stored offline
name: str
# with an override on store_offline, this feature will be stored online with a max staleness of 30 days, but
# will not be stored offline
age: int = feature(store_offline=False)
# with overrides for store_online and store_offline, computed feature values for location will not be persisted
# in either the online or offline store
location: str = feature(store_online=False, store_offline=False)
# joined features are not scalar features, and hence do not inherit the storage behavior of the feature class.
# because neither the Job feature class nor the
jobs: "DataFrame[Job]"
# because the Record feature class has a max staleness of 30 days, and so does the join, this join
# would be cached on the Driver feature class
records: "DataFrame[Record]" = has_many(lambda: Record.driver_id == Driver.id, max_staleness="30d")
# with no max_staleness set, features in the Job feature class will by default by stored offline, but not online
@features
class Job:
id: int
# with no overrides, this feature will be stored offline, but not online
driver_id: Driver.id
# with an override for store_offline, computed feature values for start_time will not be persisted
start_time: datetime = feature(store_offline=False)
# with overrides for store_online and store_offline, computed feature values for end_time will be persisted
# in the online store, but not the offline store
end_time: datetime = feature(store_online=True, store_offline=False)
# with max_staleness set, all scalar features in the Record feature class will be cached in the online store with a
# max staleness of 30 days and stored offline by default.
@features(max_staleness="30d")
class Record:
id: int
# with an override on max_staleness, all other features in the feature class would be cached with a max
# staleness of 30 days, but driver_id would be cached with a max staleness of 1 day.
driver_id: Driver.id = feature(max_staleness="1d")
timestamp: datetime
record_details: str