Chalk home page
Docs
API
CLI
  1. Queries
  2. Overview

To request or compute data with Chalk, you’ll use queries. In general, when you run a Chalk query, you are either requesting the most up-to-date values for your feature classes, requesting a set of historical data for your feature classes, or running a backfill or batch job.

The first use case is accomplished through online queries, which try to return values for a single feature class as quickly as possible, taking advantage of caching and distributed execution.

The latter use cases are accomplished through offline queries which use the offline store and can return multiple instances of a feature class for multiple primary keys or timepoints.

Running queries

At a high level, a query specifies input features and output features. Inputs differ slightly for online queries and offline queries, but in both cases the input must contain the primary key of your requested feature class.

Running online queries

Online queries can be run using query:

chalk query --in user.id=1 --out user.name

or one of our API Clients:

from chalk.client import ChalkClient

client = ChalkClient()
client.query(
  input={'user.id': 1},
  output=['user.name'],
  # branch='test', # run against a branch
)

Running offline queries

Offline queries can be run with one of our API Clients:

from chalk.client import ChalkClient
from datetime import datetime

client = ChalkClient()
client.offline_query(
  input={'user.id': [1,2,3,4]},
  output=['user.name']
  # branch='test',                      # run against a branch
  # recompute_features=True,            # recompute features
  # run_asynchronously=True,            # run in separate pod from active deployment
  # max_samples = 10,                   # max of 10 samples
  # lower_bound=datetime(2024, 10, 12), # sample computed after 10.12.2024
  # upper_bound=datetime(2024, 10, 20), # sample computed before 10.20.2024
)

Running queries using gRPC

If you have a gRPC query server active in your environment, you can also run queries using the gRPC client:

from chalk.client.client_grpc import ChalkGRPCClient

grpc_client = ChalkGRPCClient()
grpc_client.query(
  input={'user.id': 1},
  output=['user.name']
)

Scheduled and triggered resolver runs

Specific resolvers can also be scheduled or triggered (for instance, as part of pipelines like Airflow). Specific queries can also be scheduled with ScheduledQuery. Triggers and schedules are useful for pulling data from “slow” data sources into your offline and online store.

Query side effects

Chalk queries can also write data. This is an essential part of Chalk: every time you compute a feature through an online query, the output is written down in the offline store. This makes it easy to:

  • create datasets from your previously computed features,
  • monitor and track your computed features over time.

Though not the default, offline queries can write data to both the offline store and the online store using etl_offline_to_online. This can be useful when backfilling data from slow data sources or when performing expensive feature computation that would otherwise significantly impact the latency of your online queries.

Online and offline query differences

Online queryOffline query
Runs only @online resolversRuns both @online and @offline resolvers
Returns one row of data about one entityReturns a DataFrame of many rows of historical data corresponding to multiple entities point-in-time
Designed to return data in millisecondsBlocks until computation is complete, not designed for millisecond-level computation
Queries the online store and calls @online resolvers for quick retrievalQueries the offline store which stores all data from online queries, unless recompute_features=True, in which case @offline and @online resolvers are used to resolve the outputs
Writes output data to online store database and offline store databaseWrites output to a parquet file containing results to cloud storage. Only writes to online store or offline store if specified.