Resolvers
Predict and train.
Chalk supports online and offline resolvers. Online resolvers produce features for online inference, and offline resolvers produce offline training data.
Online and offline resolvers have different read/write needs. Your online environment stores data that you want to cache for the period you’d like to cache it. While this is usually a small amount of data, you want exceptionally fast reads. These access patterns fit well with the performance characteristics of Redis or BigTable, where Chalk stores your online data.
The offline environment stores far more data than the online environment. It keeps a record of all online runs and indexes all data brought in from your offline data sources. This timeseries data is highly compressible, as many features change less frequently than they are sampled (for example, users on your platform might very rarely change their names.) Chalk uses the column store Timescale or the data warehouse BigQueryfor this data.
In addition, offline queries write their output to a parquet file in cloud storage (S3/GCS), whereas online queries write their results to database.
Online queries are used to receive information about a single entity. For example, you might be looking to compute the features of a credit model for a single user, or decide what products to suggest to a customer. Thus, online queries are designed to be as quick as possible - within milliseconds. You can use our API client to pull this information.
Offline queries are used to sample historical data about many entities at specific points in time for model training or investigation. When you execute an offline query, Chalk will kick off a job that acquires the requested data for every primary key/timestamp combination presented. This could take a few seconds! Since offline queries often lookup data for thousands of rows, they are not designed to be used to make millisecond-level decisions. See our guide on querying training data for a more in-depth treatment.
online query | offline query | |
---|---|---|
online resolver | @online resolver will run | @online resolver will run if there is no @offline resolver with the same definition |
offline resolver | @offline resolver will never run | @offline resolver will run |
Both offline and online resolvers support scheduled runs. However, you’re likely to consider scheduling more frequently with offline resolvers. Every offline resolver pulls in data from your underlying source to Chalk’s feature store on a schedule. Chalk will determine the frequency at which to poll your data sources, or you can choose to provide a custom duration on which to pull the data.
Online resolvers can also be scheduled, although they don’t necessarily run on a schedule as offline resolvers do.
Both online and offline scheduled resolvers may take arguments, and you have control over the sets of arguments to run.
After an online resolver runs, its values are copied into the offline store. When you query the offline store, you will receive data from both records of online runs and offline-specific resolvers. Which data you receive depends on which data was closest to the point-in-time that you queried. For more information, see temporal consistency.
In contrast, data from the offline environment does not reach the online store by default. However, you can choose to ETL the data from an offline resolver into the online store. This can be helpful, for example, when you tolerate stale data in online inference and have a data source in the offline environment that doesn’t have a direct replacement in the online environment. More details are provided in the section Reverse ETL.
Online query | Offline query |
---|---|
Runs only @online resolvers | Runs both @online and @offline resolvers |
Returns one row of data about one entity | Returns a dataframe of many rows of historical data corresponding to multiple entities point-in-time |
Designed to return data immediately in milliseconds | Blocks until computation is complete, not designed for millisecond-level computation |
Queries the online store, which caches recent data from online queries for quick retrieval | Queries the offline store (Timescale), which stores all data from both online and offline queries |
Writes output data to online store database and offline store database | Writes output to offline store database and a parquet file containing results to cloud storage. Only writes to online store if specified. |