Queries
Track and manage named queries in Chalk
With Chalk NamedQuery objects, you can define and version
your common query patterns in code.
This provides several advantages:
Named queries typically map to specific models that you’re running in production. While feature classes model your domain objects (users, transactions, accounts) and may contain hundreds of features for reuse across different models, a named query selects only the specific subset of features needed for a particular model.
For example, you might have a User feature class with 100+ features capturing everything about a user: their
profile, behavior metrics, transaction history aggregations, and risk signals. However, your fraud detection model
might only need 15 specific features, while your recommendation model needs a different set of 30 features.
Named queries let you define these model-specific feature sets, ensuring each model gets exactly what it needs
without unnecessary computation.
This separation allows different models to access the same domain objects through feature classes while only requesting the features they need, improving performance and making it easier to track which features each model depends on.
To define a named query, add a NamedQuery object to your Chalk deployment:
from chalk import NamedQuery
from src.models import User
NamedQuery(
name="fraud",
input=[User.id],
output=[
User.email_age_days,
User.denylisted,
User.credit_report.flags,
],
tags=["team:fraud"],
owner="jodie@chalk.ai",
description="Primary fraud model for signup"
)Running chalk apply makes the named query available in your deployment.
Named queries can then be leveraged through any of our clients by specifying the query_name parameter.
Using the Chalk CLI tool, this looks something like:
chalk query --in user.id=1 --query-name fraudBecause a named query has been specified, you don’t need to explicitly pass in the tags and outputs for your query. The above command is equivalent to running the more complicated:
chalk query \
--in user.id=1 \
--out user.email_age_days \
--out user.denylisted \
--out user.credit_report.flags \
--tag team:fraudThis feature is also accessible in all of our API clients through the query_name parameter.
For instance, in Python, you can run:
from chalk.client import ChalkClient
ChalkClient().query(
input={"user.id": 1},
query_name="fraud",
)You can also run a named query offline, provided that all outputs have offline resolvers.
from chalk.client import ChalkClient
ChalkClient().offline_query(
input={"user.id": 1},
query_name="fraud",
recompute_features=True,
)
df = dataset.get_data_as_pandas()To see all the named queries you’ve defined in your current active deployment, you can run:
$ chalk named-query list
<example output>If you want to create multiple versions of a similar query, you can use the version
parameter of the NamedQuery object
and the query_name_version parameter of our various clients.
Note, when executing a named query both the query name and the query version must match. This means that if you’ve defined two named queries in your codebase:
from chalk import NamedQuery
from src.models import User
NamedQuery(
name="fraud",
input=[User.id],
output=[User.denylisted],
)
NamedQuery(
name="fraud",
version="1.1.0",
input=[User.id],
output=[
User.email_age_days,
User.denylisted,
User.credit_report.flags,
],
)And you run the following query:
chalk query --in user.id=1 --query-name fraudWe will return User.denylisted since the first named query has no version and no version was passed
through query-name-version. To access a version named query, the version must be
explicitly passed. For example:
chalk query --in user.id=1 --query-name fraud --query-name-version 1.1.0Defining NamedQuery objects is the recommended way to ensure that your queries will be
pre-planned on start-up, so that their planning time will not impact your query latency. By default, the environment variable
CHALK_PRE_PLAN_NAMED_QUERIES=1 should be set to enable this. However, sometimes
defining NamedQuery objects is not ergonomic or possible. For example, if you are
a platform team serving multiple teams, you may not want to define a NamedQuery object for every
query that your users run.
In this case, you can cache ad-hoc query plans by setting the following environment variables:
CHALK_STORE_ADHOC_QUERIES=true
CHALK_PLAN_ADHOC_QUERIES=3The first environment variable will enable writing down query requests to the database. Setting the
second environment variable to 3 will make the engine pod plan up to 3 of the most recently saved ad-hoc queries.
These ad-hoc queries are re-planned at boot, so code or platform changes will be reflected in the query plan. With
ad-hoc query caching enabled, you can cache the sketches of your most frequent queries without defining
the queries in code.
The downside of caching and pre-planning ad-hoc query plans is that the pre-planning a large number of query plans during boot can take a lot of time. To help alleviate this, the Durable Plan Cache can be used. Query plans (as opposed to query requests) can be serialized and written down to the Durable Plan Cache. These plans will persist across pods and can be pre-loaded (as opposed to planned) into the in-memory plan cache on next pod startup. Since pre-loading is faster than planning, pod startups will take less time.
You can configure the Durable Plan Cache with the following environment variables:
CHALK_PERSIST_DURABLE_PLAN_CACHE=true
CHALK_PREPOPULATE_DURABLE_PLAN_CACHE=10
CHALK_PREPOPULATE_DURABLE_PLAN_CACHE_DURATION=259200CHALK_PERSIST_DURABLE_PLAN_CACHE enables writing down query plans to the Durable Plan Cache. Then, you can choose
how to load query plans on new pods using either CHALK_PREPOPULATE_DURABLE_PLAN_CACHE=k, which will load the top k most recent query plans
written,
or CHALK_PREPOPULATE_DURABLE_PLAN_CACHE_DURATION={DURATION_IN_SECONDS} which will load all query plans written within the specified duration
since now.
Query plans written by a resource group in a deployment will only be valid for that resource group in the same deployment.