Resolvers
Eliminate train-serve skew with shared logic.
Chalk supports two primary execution contexts: online (for inference) and offline (for training). Online contexts handle real-time prediction requests with millisecond latency requirements, while offline contexts process historical data for model training and analysis.
Chalk enables you to write resolvers that can run in either or both contexts, helping eliminate train-serve skew while optimizing for the different performance characteristics of each environment.
Online resolvers are eligible to run in online and offline queries. However, in an offline context, Chalk will prefer to run an offline resolver to an online resolver if both are available for the same feature.
Offline resolvers can run only in offline queries, and can never be used as part of an online query.
There are three types of resolvers: expressions, Python resolvers, and SQL resolvers. Expressions are online resolvers, while Python and SQL resolvers can be specified as either offline or online.
Chalk expressions are online resolvers. However, an expression can operate on data that is fetched from either an online or offline resolver. Expressions are written in terms of the data model, and thus are portable between online and offline contexts.
For example, consider the following feature definition:
from chalk import _
from chalk.features import features
@features
class User:
id: int
first_name: str
last_name: str
full_name: str = _.first_name + " " + _.last_name
The full_name
feature is defined as an expression that concatenates the first_name
and last_name
features.
No matter how we compute first_name
and last_name
, the expression for full_name
remains the same.
If we have a PostgreSQL online resolver that computes first_name
and last_name
from a low-latency database,
we can use that resolver in an online query.
If we have a Snowflake offline resolver that computes first_name
and last_name
from a data warehouse, we can use that resolver in an offline query.
Scalar expressions like this one are also time-independent. So long as the underlying features
are temporally consistent, the expression will be as well.
If your expression depends on time-varying data, such as a DataFrame
aggregation,
you can use _.chalk_now
or _.chalk_window
(in the case of a windowed aggregation)
to ensure temporal consistency.
from datetime import timedelta, datetime
from chalk import _
from chalk.features import features, DataFrame, windowed, Windowed
@features
class Transaction:
id: int
user_id: "User.id"
amount: float
at: datetime
@features
class User:
id: int
transactions: DataFrame[Transaction]
total_amount_last_30d: float = _.transactions[
_.at >= _.chalk_now - timedelta(days=30),
_.amount
].sum()
average_transaction_amount: Windowed[float] = windowed(
"1d", "30d", "365d",
expression=_.transactions[_.amount, _.at > _.chalk_window].mean(),
)
Now, total_amount_last_30d
and average_transaction_amount
will compute
the same values in both online and offline contexts,
regardless of how the transactions
for the User
are fetched.
At inference time, _.chalk_now
will be the current time,
and _.chalk_window
will be the time of the query less the window size.
At training time, _.chalk_now
will be the point-in-time of the query,
and _.chalk_window
will be the point-in-time of the query less the window size.
To define a Python resolver as online or offline,
use the @online
or @offline
decorator.
from chalk.features import online, offline, DataFrame
from src.features import User, Transaction
@online
def get_email_username(email: User.email) -> User.email_username:
username = email.split("@")[0]
if "gmail.com" in email:
username = username.split("+")[0].replace(".", "")
return username.lower()
@offline
def get_transactions() -> DataFrame[
Transaction.id,
Transaction.amount,
Transaction.date,
]:
return DataFrame.read_parquet(...)
Online Python resolvers can run in both online and offline queries, while offline Python resolvers can run only in offline queries. Typically, almost all Python resolvers will be online. The most common use-case for offline Python resolvers is to load parquet files or other data files that represent historical data.
If your online Python resolver is time-dependent, you can pull in the point-in-time of the query to ensure temporal consistency.
from chalk import Now, online
@online
def get_age(birthdate: User.birthdate, now: Now) -> User.age:
return (now.date() - birthdate).days // 365
At inference time, chalk.Now
will be the current time,
and at training time, chalk.Now
will be the point-in-time of the query.
To define a SQL resolver as online or offline, use the type
field in the resolver comment.
-- resolves: User
-- source: postgres
-- type: online
select
id,
name,
age
from users
With SQL resolvers, you’re likely to want to define an online and offline variant. You would never want to run a query against, for example, Snowflake in an online context. Snowflake, and other data warehouses like BigQuery, Databricks, and Redshift, are not designed for low-latency queries. Instead, you would want to run a query against a low-latency database like Postgres or MySQL in an online context.
In contrast, with an offline query, you might want to run a query against Snowflake to get a large amount of data, and you might not have a Postgres instance that contains all the data you need or can handle unloading terabytes of data.
-- resolves: User
-- source: snowflake
-- type: offline
select
id,
name,
first_name,
last_name,
age
from users
-- resolves: User
-- source: postgres
-- type: online
select
id,
name,
age
from users
Note that these two resolvers can have different schemas.
The offline Snowflake resolver above returns first_name
and last_name
in addition to name
,
while the online Postgres resolver returns only name
.
Introducing different online and offline resolvers leads to an opportunity to introduce different means of computing a feature. Online and offline SQL resolvers are a necessary source of skew. Data warehouses cannot stand up to production traffic request volumes, and transactional databases cannot store the nearly unlimited data volumes where warehouses excel.
However, we should aim to minimize this skew as much as possible. For example, consider these aggregation features for user transaction statistics:
@features
class User:
id: int
count: int
avg_amount: float
amount_stddev: float
These SQL resolvers look nearly identical, but they produce subtly different results due to differences in how Postgres and Snowflake handle standard deviation calculations and rounding. The STDDEV_POP()
in Postgres computes population standard deviation, while Snowflake’s STDDEV()
defaults to sample standard deviation. Additionally, the rounding methods (::DECIMAL
vs ROUND()
) can produce different results for edge cases. These small numerical differences can compound and degrade model performance in production.
Implementation for Postgres online context
-- resolves: User
-- source: postgres
-- type: online
select
user_id as id,
count(*) as count,
avg(amount)::decimal(10,2) as avg_amount,
stddev_pop(amount)::decimal(10,2) as amount_stddev
from transactions
where created_at >= current_date - interval '30 days'
group by user_id
Now you need to implement the same logic again for the offline context, but adapted for Snowflake’s SQL dialect. Notice how even though the intent is identical, the implementation details differ—different function names, different date arithmetic syntax, and different rounding behavior.
Separate implementation for Snowflake offline context
-- resolves: User
-- source: snowflake
-- type: offline
select
user_id as id,
count(*) as count,
round(avg(amount), 2) as avg_amount,
round(stddev(amount), 2) as amount_stddev
from transactions
where created_at >= dateadd('day', -30, current_date())
group by user_id
Instead of computing these aggregations differently in each database, you can fetch the raw transaction data and compute the statistics using Chalk expressions. This ensures the exact same computation logic runs in both online and offline contexts, completely eliminating this source of train-serve skew. The expression syntax is simple and declarative—you just specify what you want to compute, and Chalk handles the execution optimally for each environment.
Using expressions to compute derived features consistently
from chalk import _
from chalk.features import features, DataFrame
@features
class User:
id: int
transactions: DataFrame[Transaction]
count: int = _.transactions.count()
avg_amount: float = _.transactions[_.amount].mean()
amount_stddev: float = _.transactions[_.amount].std()
Chalk includes an online and an offline store. The online store is used to cache data to make realtime data requests extremely fast, while the offline store is a data warehouse that stores all the features that you’ve computed—enabling monitoring and dataset generation for training.
Your online store holds data that you want to cache according to the cache policies you set on your features or on your queries. At inference time, you need exceptionally fast reads to serve data at low latency. However, it’s not important to keep every value of every historical feature value around for low-latency access. This access pattern aligns with the performance characteristics of Redis or DynamoDB, which Chalk uses to store your online data.
The offline store holds far more data than the online store. It keeps a record of all online runs and indexes all data brought in from your offline data sources. Chalk integrates with a number of different data warehouse systems for our large-scale offline storage depending on customer needs and deployment type, including BigQuery, Snowflake, and Redshift.
In addition, offline queries write their output to a parquet file in cloud storage (S3/GCS), whereas online queries write their results to database.
writes to the offline store | writes to the online store |
---|---|
an online query writes all freshly computed features (those not read from the online store) | an online query writes all freshly computed features with max_staleness != 0 |
A triggered resolver run with store_offline=True (default behavior) | A triggered resolver run with store_online=True (default behavior) |
scheduled queries with recompute_features=True and store_offline=True (default behavior) | scheduled query with store_online=True (default behavior) |
Ingesting a dataset to the offline store: dataset.ingest(store_offline=True) | Ingesting a dataset to the online store: dataset.ingest(store_online=True) |
offline queries with ChalkClient.offline_query(store_offline=True) | offline queries with ChalkClient.offline_query(store_online=True) |
An @online or @offline scheduled resolver | An @online or @offline scheduled resolver that computes features with max_staleness!=0 and etl_offline_to_online=True |
A streaming resolver: @stream | A streaming resolver: @stream |
Every request you make to Chalk for data is done through a query, and every query you make is either an online or an offline query.
Online queries are used to receive information about a single entity. For example, you might be looking to compute the features of a credit model for a single user, or decide what products to suggest to a customer. Thus, online queries are designed to be as quick as possible— within milliseconds. You can use our API client to run queries.
Offline queries are used to sample historical data about many entities at specific points in time for model training or investigation. When you execute an offline query, Chalk will kick off a job that acquires the requested data for every primary key/timestamp combination presented. This could take a few seconds! Since offline queries often lookup data for thousands of rows, they are not designed to be used to make millisecond-level decisions. See our guide on offline queries for a more in-depth treatment.
online query | offline query | |
---|---|---|
online resolver | @online resolver will run | @online resolver will run if there is no @offline resolver with the same definition |
offline resolver | @offline resolver will never run | @offline resolver will run |
After an online resolver runs, its values are copied into the offline store. When you query the offline store, you will receive data from both records of online runs and offline-specific resolvers. Which data you receive depends on which data was closest to the point-in-time that you queried. For more information, see temporal consistency.
In contrast, data from the offline store does not reach the online store by default. However, you can choose to ETL the data from an offline resolver into the online store. This can be helpful, for example, when you tolerate stale data in online inference and have a data source in the offline store that doesn’t have a direct replacement in the online store. More details are provided in the section Reverse ETL.
Online query | Offline query |
---|---|
Runs only @online resolvers | Runs both @online and @offline resolvers |
Returns one row of data about one entity | Returns a DataFrame of many rows of historical data corresponding to multiple entities at historical time points |
Designed to return data in milliseconds | Blocks until computation is complete, not designed for millisecond-level computation |
Queries the online store and calls @online resolvers for quick retrieval | Queries the offline store which stores all data from online queries, unless recompute_features=True , in which case @offline and @online resolvers are used to resolve the outputs |
Writes output data to online store database and offline store database | Writes output to a parquet file containing results to cloud storage. Only writes to online store or offline store if specified. |