Queries
Fetch feature values via queries.
To request or compute data with Chalk, you’ll use queries. In general, when you run a Chalk query, you are either requesting the most up-to-date values for your feature classes, requesting a set of historical data for your feature classes, or running a backfill or batch job.
The first use case is accomplished through online queries, which try to return values for a single feature class as quickly as possible, taking advantage of caching and distributed execution.
The latter use cases are accomplished through offline queries which use the offline store and can return multiple instances of a feature class for multiple primary keys or timepoints.
At a high level, a query specifies input features and output features. Inputs differ slightly for online queries and offline queries, but in both cases the input must contain the primary key of your requested feature class.
Online queries can be run using query:
chalk query --in user.id=1 --out user.name
or one of our API Clients:
from chalk.client import ChalkClient
client = ChalkClient()
client.query(
input={'user.id': 1},
output=['user.name'],
# branch='test', # run against a branch
)
Offline queries can be run with one of our API Clients:
from chalk.client import ChalkClient
from datetime import datetime
client = ChalkClient()
client.offline_query(
input={'user.id': [1,2,3,4]},
output=['user.name']
# branch='test', # run against a branch
# recompute_features=True, # recompute features
# run_asynchronously=True, # run in separate pod from active deployment
# max_samples = 10, # max of 10 samples
# lower_bound=datetime(2024, 10, 12), # sample computed after 10.12.2024
# upper_bound=datetime(2024, 10, 20), # sample computed before 10.20.2024
)
If you have a gRPC query server active in your environment, you can also run queries using the gRPC client:
from chalk.client.client_grpc import ChalkGRPCClient
grpc_client = ChalkGRPCClient()
grpc_client.query(
input={'user.id': 1},
output=['user.name']
)
Specific resolvers can also be scheduled or triggered (for instance, as part of
pipelines like Airflow). Specific queries can also be scheduled with
ScheduledQuery
. Triggers and schedules are useful for pulling data from “slow” data
sources into your offline and online store.
Chalk queries can also write data. This is an essential part of Chalk: every time you compute a feature through an online query, the output is written down in the offline store. This makes it easy to:
Though not the default, offline queries can write data to both the offline store and the online store using
etl_offline_to_online
. This can be useful when backfilling data from slow
data sources or when performing expensive feature computation that would otherwise significantly impact the latency of
your online queries.
Online query | Offline query |
---|---|
Runs only @online resolvers | Runs both @online and @offline resolvers |
Returns one row of data about one entity | Returns a DataFrame of many rows of historical data corresponding to multiple entities point-in-time |
Designed to return data in milliseconds | Blocks until computation is complete, not designed for millisecond-level computation |
Queries the online store and calls @online resolvers for quick retrieval | Queries the offline store which stores all data from online queries, unless recompute_features=True , in which case @offline and @online resolvers are used to resolve the outputs |
Writes output data to online store database and offline store database | Writes output to a parquet file containing results to cloud storage. Only writes to online store or offline store if specified. |