Queries
Track and manage named queries in Chalk
With Chalk NamedQuery
objects, you can define and version
your common query patterns in code.
This provides several advantages:
Named queries typically map to specific models that you’re running in production. While feature classes model your domain objects (users, transactions, accounts) and may contain hundreds of features for reuse across different models, a named query selects only the specific subset of features needed for a particular model.
For example, you might have a User
feature class with 100+ features capturing everything about a user: their
profile, behavior metrics, transaction history aggregations, and risk signals. However, your fraud detection model
might only need 15 specific features, while your recommendation model needs a different set of 30 features.
Named queries let you define these model-specific feature sets, ensuring each model gets exactly what it needs
without unnecessary computation.
This separation allows different models to access the same domain objects through feature classes while only requesting the features they need, improving performance and making it easier to track which features each model depends on.
To define a named query, add a NamedQuery
object to your Chalk deployment:
from chalk import NamedQuery
from src.models import User
NamedQuery(
name="fraud",
input=[User.id],
output=[
User.email_age_days,
User.denylisted,
User.credit_report.flags,
],
tags=["team:fraud"],
owner="jodie@chalk.ai",
description="Primary fraud model for signup"
)
Running chalk apply
makes the named query available in your deployment.
Named queries can then be leveraged through any of our clients by specifying the query_name
parameter.
Using the Chalk CLI tool, this looks something like:
chalk query --in user.id=1 --query-name fraud
Because a named query has been specified, you don’t need to explicitly pass in the tags and outputs for your query. The above command is equivalent to running the more complicated:
chalk query \
--in user.id=1 \
--out user.email_age_days \
--out user.denylisted \
--out user.credit_report.flags \
--tag team:fraud
This feature is also accessible in all of our API clients through the query_name
parameter.
For instance, in Python, you can run:
from chalk.client import ChalkClient
ChalkClient().query(
input={"user.id": 1},
query_name="fraud",
)
You can also run a named query offline, provided that all outputs have offline resolvers.
from chalk.client import ChalkClient
ChalkClient().offline_query(
input={"user.id": 1},
query_name="fraud",
recompute_features=True,
)
df = dataset.get_data_as_pandas()
To see all the named queries you’ve defined in your current active deployment, you can run:
$ chalk named-query list
<example output>
If you want to create multiple versions of a similar query, you can use the version
parameter of the NamedQuery
object
and the query_name_version parameter of our various clients.
Note, when executing a named query both the query name and the query version must match. This means that if you’ve defined two named queries in your codebase:
from chalk import NamedQuery
from src.models import User
NamedQuery(
name="fraud",
input=[User.id],
output=[User.denylisted],
)
NamedQuery(
name="fraud",
version="1.1.0",
input=[User.id],
output=[
User.email_age_days,
User.denylisted,
User.credit_report.flags,
],
)
And you run the following query:
chalk query --in user.id=1 --query-name fraud
We will return User.denylisted
since the first named query has no version and no version was passed
through query-name-version
. To access a version named query, the version must be
explicitly passed. For example:
chalk query --in user.id=1 --query-name fraud --query-name-version 1.1.0
Sometimes defining NamedQuery
objects is not ergonomic or possible. For example, if you are
a platform team serving multiple teams, you may not want to define a NamedQuery
object for every
query that your users run.
In this case, you can use these environment variables:
CHALK_STORE_ADHOC_QUERIES=true
CHALK_PLAN_ADHOC_QUERIES=3
The first environment variable will cache the ad-hoc query requests in the database. The second
environment variable will plan up to 3
of the most recent ad-hoc queries. These Ad-hoc queries
are re-planned at boot so that code or platform changes can be reflected in the query plan.