Chalk home page
Docs
API
CLI
  1. Resolvers
  2. Online / Offline

Chalk supports online and offline resolvers. Online resolvers produce features for online inference, and offline resolvers produce offline training data.

Storage

Online and offline resolvers have different read/write needs. Your online environment stores data that you want to cache for the period you’d like to cache it. While this is usually a small amount of data, you want exceptionally fast reads. These access patterns fit well with the performance characteristics of Redis or BigTable, where Chalk stores your online data.

The offline environment stores far more data than the online environment. It keeps a record of all online runs and indexes all data brought in from your offline data sources. This timeseries data is highly compressible, as many features change less frequently than they are sampled (for example, users on your platform might very rarely change their names.) Chalk uses the column store Timescale or the data warehouse BigQueryfor this data.

In addition, offline queries write their output to a parquet file in cloud storage (S3/GCS), whereas online queries write their results to database.

Querying

Online queries are used to receive information about a single entity. For example, you might be looking to compute the features of a credit model for a single user, or decide what products to suggest to a customer. Thus, online queries are designed to be as quick as possible - within milliseconds. You can use our API client to pull this information.

Offline queries are used to sample historical data about many entities at specific points in time for model training or investigation. When you execute an offline query, Chalk will kick off a job that acquires the requested data for every primary key/timestamp combination presented. This could take a few seconds! Since offline queries often lookup data for thousands of rows, they are not designed to be used to make millisecond-level decisions. See our guide on querying training data for a more in-depth treatment.

online queryoffline query
online resolver @online resolver will run @online resolver will run if there is no @offline resolver with the same definition
offline resolver @offline resolver will never run @offline resolver will run

Scheduling

Both offline and online resolvers support scheduled runs. However, you’re likely to consider scheduling more frequently with offline resolvers. Every offline resolver pulls in data from your underlying source to Chalk’s feature store on a schedule. Chalk will determine the frequency at which to poll your data sources, or you can choose to provide a custom duration on which to pull the data.

Online resolvers can also be scheduled, although they don’t necessarily run on a schedule as offline resolvers do.

Both online and offline scheduled resolvers may take arguments, and you have control over the sets of arguments to run.


Interaction

Online-to-Offline

After an online resolver runs, its values are copied into the offline store. When you query the offline store, you will receive data from both records of online runs and offline-specific resolvers. Which data you receive depends on which data was closest to the point-in-time that you queried. For more information, see temporal consistency.

Offline-to-online

In contrast, data from the offline environment does not reach the online store by default. However, you can choose to ETL the data from an offline resolver into the online store. This can be helpful, for example, when you tolerate stale data in online inference and have a data source in the offline environment that doesn’t have a direct replacement in the online environment. More details are provided in the section Reverse ETL.

Summary

Online queryOffline query
Runs only @online resolversRuns both @online and @offline resolvers
Returns one row of data about one entityReturns a dataframe of many rows of historical data corresponding to multiple entities point-in-time
Designed to return data immediately in millisecondsBlocks until computation is complete, not designed for millisecond-level computation
Queries the online store, which caches recent data from online queries for quick retrievalQueries the offline store (Timescale), which stores all data from both online and offline queries
Writes output data to online store database and offline store databaseWrites output to offline store database and a parquet file containing results to cloud storage. Only writes to online store if specified.