Queries
Fetch offline feature values.
Offline queries pull data from the offline store or calculate features through resolvers that are marked as offline.
Offline queries can also execute online resolvers if no offline resolver is available for a requested feature.
In an offline query you can request features for multiple entities at distinct time points. By default, an offline query returns a row for each primary key containing the most recent computed value for each requested output feature. The main use case for offline queries is creating datasets.
Chalk supports a number of clients for offline queries. In this section, our examples use our Python client, which integrates nicely with Jupyter notebooks.
As mentioned earlier, offline queries can be made through one of Chalk’s API clients.
from chalk.client import ChalkClient
from datetime import datetime
client = ChalkClient()
client.offline_query(
input={'user.id': [1,2,3,4]}, # Input
output=['user.name'], # Output
# tags=['test'], # Environment
# branch='branch',
# recompute_features=True, # Run Resolvers
# run_asynchronously=True, # Run Configuration
# max_samples = 10,
# lower_bound=datetime(2024, 10, 12), # Bounds
# upper_bound=datetime(2024, 10, 20),
)
Offline queries return Chalk Datasets, which we cover in more detail in the
next section. At a high level, Chalk Dataset
s are flexible wrappers
for the results of your offline query. They can be converted to pandas or polars DataFrames for ease of use
in downstream ML tasks.
Offline queries can receive input from a DataFrame, mapping, or SQL query. Regardless of the format, the primary key feature must be included.
The input parameter accepts chalk.DataFrame, pandas.DataFrame, or a mapping. DataFrames should include one column for each known feature in the input. Mappings should map from each feature to a list of its values:
input={
User.id: ['id1', 'id2'],
User.age: [23, 40]
}
offline_query
also takes spine_sql_query as an alternative
input parameter for retrieving feature values from your offline data store.
Spine SQL queries are recommended in the following scenarios:
Spine SQL queries accept features from multiple feature namespaces as columns of the result. Each column must either correspond to an existing feature or be included in the output list. For each referenced feature namespace, the feature namespace’s primary key must be included as a column.
The “ts” column is always interpreted as the query execution time for the row. See our documentation on temporal consistency for more details.
# This spine_sql_query queries the table `transactions` in the Snowflake offline store instance.
output = chalk_client.offline_query(
spine_sql_query=f"""
SELECT
t.txn_time AS ts,
t.seller_id AS "seller.id",
t.buyer_id AS "buyer.id",
t.amount AS "txn.amount",
t.payment_type AS "txn.payment_type"
FROM transactions AS t
WHERE t.update_at >= {now - timedelta(days=30)}
""",
outputs=[
Seller.id,
Buyer.id,
Buyer.account_created_date,
Txn.payment_type,
# Computed in the seller namespace from the 'seller.id' spine feature.
Seller.recent_transactions_volume,
# Computed in the buyer namespace from the 'buyer.id' spine feature.
Buyer.total_spent_last_30d,
# Passed through from the SQL query
Txn.amount,
],
)
Timestamps can be also be passed in the input_times argument instead.
input={
User.id: ['id1', 'id1'],
}
input_times=[datetime.now() - timedelta(days=1), datetime.now() - timedelta(days=2)]
This argument describes a list of features to sample.
output=[
User.returned_transactions_last_60,
User.user_account_name_match_score,
User.socure_score,
User.identity.has_verified_phone,
User.identity.is_voip_phone,
User.identity.account_age_days,
User.identity.email_age,
]
Users can request that certain features be recomputed by resolvers at query time instead of sampled from the offline store. For more information, see recompute_features.
In some cases, users may not have a list of primary keys to sample with,
and instead would like to see results within a period of time.
The user can then leave the inputs
argument empty and supply a
lower bound and an
upper bound along with the requested output features.
dataset: Dataset = ChalkClient().offline_query(
output=[
User.id,
User.fullname,
User.email,
User.name_email_match_score,
],
lower_bound=datetime.now() - timedelta(days=7),
upper_bound=datetime.now(),
)
The user can specify
tags,
environment,
or branch
as parameters to offline_query
in the same fashion
as online query.