Chalk home page
Docs
API
CLI
  1. Queries
  2. Offline Queries

Offline queries pull data from the offline store or calculate features through resolvers that are marked as offline.

Offline queries can also execute online resolvers if no offline resolver is available for a requested feature.

In an offline query you can request features for multiple entities at distinct time points. By default, an offline query returns a row for each primary key containing the most recent computed value for each requested output feature. The main use case for offline queries is creating datasets.

Chalk supports a number of clients for offline queries. In this section, our examples use our Python client, which integrates nicely with Jupyter notebooks.

localhost:3000
Chalk AI - Documentation Reference
Jupyter Notebook
Chalk AI - Alerts

Making offline queries

As mentioned earlier, offline queries can be made through one of Chalk’s API clients.

from chalk.client import ChalkClient
from datetime import datetime

client = ChalkClient()
client.offline_query(
  input={'user.id': [1,2,3,4]},         # Input
  output=['user.name'],                 # Output
  # tags=['test'],                      # Environment
  # branch='branch',
  # recompute_features=True,            # Run Resolvers
  # run_asynchronously=True,            # Run Configuration
  # max_samples = 10,
  # lower_bound=datetime(2024, 10, 12), # Bounds
  # upper_bound=datetime(2024, 10, 20),
)

Offline queries return Chalk Datasets, which we cover in more detail in the next section. At a high level, Chalk Datasets are flexible wrappers for the results of your offline query. They can be converted to pandas or polars DataFrames for ease of use in downstream ML tasks.

Input

Offline queries can receive input from a DataFrame, mapping, or SQL query. Regardless of the format, the primary key feature must be included.

DataFrames and mappings as input

The input parameter accepts chalk.DataFrame, pandas.DataFrame, or a mapping. DataFrames should include one column for each known feature in the input. Mappings should map from each feature to a list of its values:

input={
    User.id: ['id1', 'id2'],
    User.age: [23, 40]
}

SQL queries as input

offline_query also takes spine_sql_query as an alternative input parameter for retrieving feature values from your offline data store.

Spine SQL queries are recommended in the following scenarios:

  1. You want to retrieve data for multiple rows at once. Chalk will compute an efficient query plan for loading multiple rows of data at once. Chalk will also reuse data between rows where appropriate.
  2. You want to query from Chalk as your offline data store. You can reduce unnecessary back-and-forth requests by having Chalk execute the SQL query and handle the result rows directly.
  3. You want to request features from multiple feature namespaces for each row of output.

Spine SQL queries accept features from multiple feature namespaces as columns of the result. Each column must either correspond to an existing feature or be included in the output list. For each referenced feature namespace, the feature namespace’s primary key must be included as a column.

The “ts” column is always interpreted as the query execution time for the row. See our documentation on temporal consistency for more details.

output = chalk_client.query(
    spine_sql_query=f"""
        SELECT
            t.txn_time AS ts,
            t.seller_id AS "seller.id",
            t.buyer_id AS "buyer.id",
            t.amount AS "txn.amount",
            t.payment_type AS "txn.payment_type"
        FROM transactions AS t
        WHERE t.update_at >= {now - timedelta(days=30)}
    """,
    outputs=[
        Seller.id,
        Buyer.id,
        Buyer.account_created_date,
        Txn.payment_type,
        # Computed in the seller namespace from the 'seller.id' spine feature.
        Seller.recent_transactions_volume,
        # Computed in the buyer namespace from the 'buyer.id' spine feature.
        Buyer.total_spent_last_30d,
        # Passed through from the SQL query
        Txn.amount,
    ],
)

Input times

Timestamps can be also be passed in the input_times argument instead.

input={
    User.id: ['id1', 'id1'],
}
input_times=[datetime.now() - timedelta(days=1), datetime.now() - timedelta(days=2)]

Output

This argument describes a list of features to sample.

output=[
    User.returned_transactions_last_60,
    User.user_account_name_match_score,
    User.socure_score,
    User.identity.has_verified_phone,
    User.identity.is_voip_phone,
    User.identity.account_age_days,
    User.identity.email_age,
]

Recompute features

Users can request that certain features be recomputed by resolvers at query time instead of sampled from the offline store. For more information, see recompute_features.

Timebounds

In some cases, users may not have a list of primary keys to sample with, and instead would like to see results within a period of time. The user can then leave the inputs argument empty and supply a lower bound and an upper bound along with the requested output features.

dataset: Dataset = ChalkClient().offline_query(
     output=[
         User.id,
         User.fullname,
         User.email,
         User.name_email_match_score,
     ],
     lower_bound=datetime.now() - timedelta(days=7),
     upper_bound=datetime.now(),
)

Environment

The user can specify tags, environment, or branch as parameters to offline_query in the same fashion as online query.