Offline Queries
Persist and evolve offline queries over time
The Chalk Dataset class governs metadata related to offline queries, supports revisions to queries over time, and enables the easy retrieval of data from the cloud.
Dataset
instances are obtained by calling ChalkClient.offline_query()
which computes feature values from the offline store.
If inputs are given, the method returns the values corresponding to those inputs.
Otherwise, the method returns a random sample according to the parameter max_samples
.
from chalk.client import ChalkClient, Dataset
uids = [1, 2, 3, 4]
at = datetime.now()
dataset: Dataset = ChalkClient().offline_query(
input={
User.id: uids,
User.ts: [at] * len(uids),
},
output=[
User.id,
User.fullname,
User.email,
User.name_email_match_score,
],
dataset_name='my_dataset'
)
sample_dataset: Dataset = ChalkClient().offline_query(
output=[
User.id,
User.fullname,
User.email,
User.name_email_match_score,
],
max_samples=10,
dataset_name='my_sample'
)
Here, we attach a unique name to the Dataset
. Whenever we send additional
queries with the same name, a new DatasetRevision instance will be created
and attached to the existing dataset.
If a dataset_name
is not given, the output data won’t be retrievable beyond the current session.
A dataset’s revisions can be inspected in Dataset.revisions
:
they hold useful metadata relating to the offline query job and the data itself.
Be sure to check out Dataset.errors
for any errors upon submitting the query.
Since offline queries are not realtime, the Dataset
instance returned
is not guaranteed to have the outputs of the query instantaneously.
Thus, loading the data may take some time.
The data can be accessed programmatically by calling
Dataset.data_as_pandas
, Dataset.data_as_polars
, or Dataset.data_as_dataframe
.
If the offline query job is still running, the Dataset
will poll the engine until the
results are completed.
from chalk.client import ChalkClient, Dataset
uids = [1, 2, 3, 4]
at = datetime.now()
dataset: Dataset = ChalkClient().offline_query(
input={
User.id: uids,
User.ts: [at] * len(uids),
},
output=[
User.id,
User.fullname,
User.email,
User.name_email_match_score,
],
dataset_name='my_dataset'
)
pandas_df: pd.DataFrame = dataset.data_as_pandas
polars_df: pl.LazyFrame = dataset.data_as_polars
chalk_df: chalk.features.DataFrame = dataset.data_as_dataframe
The file outputs of the query themselves can also be downloaded to a specific directory.
from chalk.client import ChalkClient, Dataset
uids = [1, 2, 3, 4]
at = datetime.now()
dataset: Dataset = ChalkClient().offline_query(
input={
User.id: uids,
User.ts: [at] * len(uids),
},
output=[
User.id,
User.fullname,
User.email,
User.name_email_match_score,
],
dataset_name='my_dataset'
)
dataset.download_data('my_directory')
By default, Dataset
instances fetch the output data from their most recent revision.
A specific DatasetRevision
’s output data can be fetched using the same methods.
from chalk.client import ChalkClient, Dataset
uids = [1, 2, 3, 4]
at = datetime.now()
dataset: Dataset = ChalkClient().offline_query(
input={
User.id: uids,
User.ts: [at] * len(uids),
},
output=[
User.id,
User.fullname,
User.email,
User.name_email_match_score,
],
dataset_name='my_dataset'
)
for revision in dataset.revisions:
print(revision.data_as_pandas)
Datasets expose a recompute method that allows users to see the results of updates to resolvers/features in the context of this dataset. Recompute takes a list of features as an argument. These “recompute_features” are recomputed with the remaining features as input. Any other required input features are sampled from the offline store.