Chalk home page
Docs
API
CLI
  1. Resolvers
  2. Online / Offline

Chalk is composed of an online and an offline store. The online store is a cache used to make realtime requests for data extremely fast, while the offline store is a data warehouse that stores all the features that you’ve computed—enabling monitoring and dataset generation for training.

Storage

Your online environment stores data that you want to cache according to the cache policies you set on your features or on your queries. While this is usually a small amount of data, you want exceptionally fast reads. This kind of access pattern aligns with the performance characteristics of Redis or Dynamo DB which Chalk uses to store your online data.

The offline environment stores far more data than the online environment. It keeps a record of all online runs and indexes all data brought in from your offline data sources. Chalk integrates with a number of different data warehouse systems for our large-scale offline storage depending on customer needs and deployment type, including BigQuery, Snowflake, and Redshift.

In addition, offline queries write their output to a parquet file in cloud storage (S3/GCS), whereas online queries write their results to database.

writes to the offline storewrites to the online store
an online query writes all freshly computed features (those not read from the online store)an online query writes all freshly computed features with max_staleness != 0
A triggered resolver run with store_offline=True (default behavior)A triggered resolver run with store_online=True (default behavior)
scheduled queries with recompute_features=True and store_offline=True (default behavior)scheduled query with store_online=True (default behavior)
Ingesting a dataset to the offline store: dataset.ingest(store_offline=True)Ingesting a dataset to the online store: dataset.ingest(store_online=True)
An @online or @offline scheduled resolverAn @online or @offline scheduled resolver that computes features with max_staleness!=0 and etl_offline_to_online=True
A streaming resolver: @streamA streaming resolver: @stream

Querying

Every request you make to Chalk for data is done through a query and every query you make is either an online or an offline query.

Online queries are used to receive information about a single entity. For example, you might be looking to compute the features of a credit model for a single user, or decide what products to suggest to a customer. Thus, online queries are designed to be as quick as possible - within milliseconds. You can use our API client to run queries.

Offline queries are used to sample historical data about many entities at specific points in time for model training or investigation. When you execute an offline query, Chalk will kick off a job that acquires the requested data for every primary key/timestamp combination presented. This could take a few seconds! Since offline queries often lookup data for thousands of rows, they are not designed to be used to make millisecond-level decisions. See our guide on querying training data for a more in-depth treatment.

online queryoffline query
online resolver @online resolver will run @online resolver will run if there is no @offline resolver with the same definition
offline resolver @offline resolver will never run @offline resolver will run

Online/Offline Interaction

Online-to-Offline

After an online resolver runs, its values are copied into the offline store. When you query the offline store, you will receive data from both records of online runs and offline-specific resolvers. Which data you receive depends on which data was closest to the point-in-time that you queried. For more information, see temporal consistency.

Offline-to-online

In contrast, data from the offline environment does not reach the online store by default. However, you can choose to ETL the data from an offline resolver into the online store. This can be helpful, for example, when you tolerate stale data in online inference and have a data source in the offline environment that doesn’t have a direct replacement in the online environment. More details are provided in the section Reverse ETL.

Summary

Online queryOffline query
Runs only @online resolversRuns both @online and @offline resolvers
Returns one row of data about one entityReturns a DataFrame of many rows of historical data corresponding to multiple entities point-in-time
Designed to return data in millisecondsBlocks until computation is complete, not designed for millisecond-level computation
Queries the online store and calls @online resolvers for quick retrievalQueries the offline store which stores all data from online queries, unless recompute_features=True in which case @offline and @online resolvers are used to resolve the outputs
Writes output data to online store database and offline store databaseWrites output to a parquet file containing results to cloud storage. Only writes to online store or offline store if specified.