Updates
Updates to Chalk!
You can now view the Kubernetes pods created by each deployment in the dashboard along with additional details like the pod states and the resources requested by each pod.
We’ve added the array_agg
function to chalk.functions
to help you resolve list features with underscore
expressions. For example, you can now express the following features to aggregate
all categories of videos watched for a user.
import chalk.functions as F
from chalk import DataFrame
from chalk.features import _, features
@features
class VideoInteraction:
id: str
video_url: str
video_category: str
user_id: "User.id"
@features
class User:
id: str
videos_watched: DataFrame[VideoInteraction]
all_watched_video_categories: list[str] = F.array_agg(_.videos[_.category])
To see all of the chalk.functions
that you can use in underscore expressions, see
our API documentation.
Users can now use the chalk usage
commands to view usage information for their
projects and environments. If you have any questions, please reach out to the Chalk team.
We now provide an idempotency key parameter for triggering resolver runs so that you can ensure that only one job will be kicked off per idempotency key provided.
The ChalkClient now has a check
function that enables you to run a query and
check whether the query outputs match your expected outputs. This function should be used with
pytest for integration testing. To read more about different
methods and best practices for integration testing, see our integration test docs.
This week, we’ve added mathematical functions floor
, ceil
, and abs
to chalk.functions
, along with
the logical functions when
, then
, otherwise
, and is_null
. We’ve also added the haversine
function
for computing the Haversine distance between two points on Earth given their latitude and longitude. These
points can be used in underscore expressions to define features with code that can be
statically compiled in C++ for faster execution. See the full list of functions you can use in underscore
expressions in our API docs.
In the dashboard, users can now view the P50, P75, P95, and P99 latencies for resolvers in the table under
the Resolver
tab of the menu. You can also customize which columns are displayed in the table by clicking
the gear icon in the top left corner of the table.
In addition, we’ve added a SQL Explorer for examining resolver output for queries that are run with the
store_plan_stages=True
parameter.
You can now use the chalk healthcheck
command in the CLI to view information on the health of Chalk’s API
server and its services. The healthcheck provides information for the API server based on the active
environment and project. To read more about the healthcheck command, see
the CLI documentation.
When running an asynchronous offline query, you can now specify the inputs num_shards
and num_workers
as
parameters to allow for more granular control over the parallelization of your query execution. To see all of the
offline query options, check out the offline query documentation.
In addition, offline query progress reporting now specifies progress by shard, giving developers more insight into where their offline query is in the execution progress.
You can now default to using the name of your current Git branch when developing using the ChalkClient
.
For example, if you have checked out a branch named my-very-own-branch
you can now set ChalkClient(branch=True)
and all of your client calls will be directed at my-very-own-branch
. To read more about how to use ChalkClient, see
our API documentation.
We’ve added more functions to chalk.functions
that can be used in underscore expressions.
You can now use regexp_like
, regexp_extract
, split_part
, and regexp_extract_all
to do regular expression matching and
use url_extract_host
, url_extract_path
, and url_extract_prtocol
to parse URL’s. In addition, we’ve added
helpful logical functions like if_then_else
, map_dict
, and cast
to broaden the span of features that you can
define using underscore expressions. To read more about all of our functions, check out our
API documentation.
We now provide more detailed build logs for deployments in AWS environments in the dashboard!
We’ve added a new function chalk.functions.sagemaker_predict
that allows
you to run predictions against a SageMaker endpoint to resolve features. Read more about how to define
a SageMaker endpoint, encode your input data, and run predictions in our SageMaker tutorial.
In addition to being able to make SageMaker calls, underscore expressions now support a
variety of new functions. With these functions imported from chalk.functions
, you can perform encoding,
decoding, math, datetime manipulation, string manipulation, and more! For example, say you have a Transaction
feature, where you make a SageMaker call to enrich the transaction data and provide a label for the transaction,
and you parse this label for other features. You can now define all of these features related to transaction
enrichment using underscore expressions and Chalk functions in the feature definition:
from datetime import date
import chalk.functions as F
from chalk.features import _, features, Primary
@features
class Transaction:
id: Primary[str]
amount: float
date: date
day: int = F.day_of_year(_.date)
month: int = F.month_of_year(_.date)
sagemaker_input_data: bytes = F.string_to_bytes(_.id, encoding="utf-8")
transaction_enrichment_label: bytes = F.sagemaker_predict(
_.sagemaker_input_data,
endpoint="transaction-enrichment-model_2.0.2024_10_28",
target_model="model_v2.tar.gz",
target_variant="production_variant_3"
)
transaction_enrichment_label_str: str = F.bytes_to_string(_.transaction_enrichment_label, encoding="utf-8")
is_rent: bool = F.like(_.transaction_enrichment_label_str, "%rent%")
is_purchase: bool = F.like(_.transaction_enrichment_label_str, "%purchase%")
You can now reference other windowed aggregations in your windowed aggregation expressions. To read more about how to define your windowed aggregations, see our example here.
We’ve updated the Usage Dashboard with a new view under the Pod Resources tab that allows you to view CPU and storage requests by pod as grouped by cluster, environment, namespace, and service! If you have any questions about the usage dashboard, please reach out to the Chalk team.
From chalkpy==2.55.0
, Chalk is dropping support for Python 3.8,
which has reached end-of-life.
If you are still using Python 3.8, please upgrade to Python 3.9 or higher.
We’ve enabled support for using Pub/Sub as a streaming source. Read more about how to use Pub/Sub as a streaming source here.
You can automatically load offline query outputs to the online and offline store using the boolean parameters
store_online
and
store_offline
. Below is an example of how to
use these parameters.
from chalk.client import ChalkClient
client = ChalkClient()
ds = client.offline_query(
input={"user.id": [1, 2, 3, 4, 5]},
output=["user.num_interactions_l7d", "user.num_interactions_l30d", "user.num_interactions_l90d"],
store_online=True,
store_offline=True
)
Customers running gRPC servers can now run SQL queries on the dataset outputs of online and offline queries in the dashboard. To enable this feature for your deployment, please reach out to the team.
We’ve updated our color scheme in the dashboard to more clearly differentiate between successes and failures in metrics graphs!
Customers can now run SQL queries on dataset outputs
in the dashboard. To use this feature, navigate to the Datasets
page in the menu, select a dataset,
and click on the Output Explorer
tab.
Last week we enabled the option to decide whether to persist null values for features in Redis lightning
online stores, and this week we have enabled this feature in DynamoDB online stores. By default, null
values are persisted in the online store for features defined as Optional
, but you can set cache_nulls=False
in the feature
method to evict null values. Read more about how to use the cache_nulls
parameter
here.
You can set cloud resource configurations for your environment by navigating to Settings > Resources
in the
dashboard. In addition to specifying resource configurations for resource groups like instance counts and CPU,
you can now also set environment variables and other settings like Kubernetes Node Selectors. The Kubernetes Node
Selector enables you to specify the machine family you would like to use for your deployment. For example,
this would map to EC2 Instance Types
for AWS deployments or Compute Engine Machine Families
for GCP deployments. If you have any questions about how to use any of these settings in the configuration page,
please reach out to the team.
Underscore expressions now support datetime subtraction and the use of a new
library function chalk.functions.total_seconds
. This allows you to compute the number of seconds
in a time duration and define more complex time interval calculations using performant underscore
expressions.
For example, to define a feature that computes the difference between two date features in days and weeks,
we can use chalk.functions.total_seconds
and underscore date expressions together.
from chalk.functions as F
from chalk.features import _, features, Primary
from datetime import date
@features
class User:
id: Primary[str]
created_at: date
last_activity: date
days_since_last_activity: float = F.total_seconds(date.today() - _.last_activity) / (60 * 60 * 24)
num_weeks_active: float = F.total_seconds(_.last_activity - _.created_at) / (60 * 60 * 24 * 7)
You can now select whether to persist null values for features in the Redis lightning online store
using the cache_nulls
parameter in the feature
method. By default, null values are persisted in the
online store for features defined as Optional
. If you set cache_nulls=False
, null values will
not be persisted in the online store.
from chalk import feature
from chalk.features import features, Primary, Optional
@features
class RestaurantRating:
id: Primary[str]
cleanliness_score: Optional[float] = feature(cache_nulls=False) # null values will not be persisted
service_score: Optional[float] = feature(cache_nulls=True) # null values will be persisted. This is the default behavior.
overall_score: float # null values are not persisted for required features
Customers running the gRPC server can now reach out to enable feature value metrics. Feature value metrics include the number of observations, number of unique values, and percentage of null values over all queries, as well as the running average and maximum of features observed. Please reach out if you’d like to enable feature value metrics.
Additionally, the feature table in the dashboard has been updated to allow for customization of columns displayed, which enables viewing request counts over multiple time ranges in the same view.
chalk.functions
now offers a cosine_similarity
function:
import chalk.functions as F
from chalk.features import _, embedding, features
@features
class Shopper:
id: str
preferences_embedding: Vector[1536]
@features
class Product:
id: str
description_embedding: Vector[1536]
@features
class ShopperProduct:
id: str
shopper_id: Shopper.id
shopper: Shopper
product_id: Product.id
product: Product
similarity: float = F.cosine_similarity(_.shopper.preferences_embedding, _.product.description_embedding)
Cosine similarity is useful when handling vector embeddings, which are often used when analyzing unstructured text. You
can also use embedding
to compute vector embeddings with Chalk.
When looking at an offline query run in the dashboard, you’ll now find a new Metrics tab showing query metadata, CPU utilization, and memory utilization.
We have a new offline query configuration option for recomputing features only when they are not already available in the offline store. This option is useful for workloads with computationally expensive features that cannot easily be recomputed. Please reach out if you’d like to try this feature.
Sometimes, a SQL resolver may fail to retrieve data due to temporary unavailability. We’ve added new options for configuring the number of retry attempts a resolver may make (and how long it should wait between attempts). If you’re interested in trying out this new functionality early, please let the team know.
When creating has-one relationships, you can set the primary key of the child feature class to the
primary key of the parent feature class. For example, you may model an InsurancePolicy
feature class as belonging to
exactly one user by setting InsurancePolicy.id
’s type to Primary[User.id]
.
Now, we’ve updated Chalk so that you can chain more of these relationships together. For example, an InsurancePolicy
feature class may have an associated InsuranceApplication
. The InsuranceApplication
may also have an associated
CreditReport
. Chalk now allows chaining an arbitrary number of has-one relationships. Chalk will also validate these
relationships to ensure there are no circular dependencies.
Here’s an example where we have features describing a system where user has one insurance policy, each policy has one submitted application, and each application has one credit report:
from chalk import Primary
from chalk.features import features
@features
class User:
id: str
# :tags: pii
ssn: str
policy: "InsurancePolicy"
@features
class InsurancePolicy:
id: Primary[User.id]
user: User
application: "InsuranceApplication"
@features
class InsuranceApplication:
id: Primary[InsurancePolicy.id]
stated_income: float
# For the sake of illustrating has-one relationships,
# we're assuming exactly one credit report per
# application, which may not be realistic. A has-many
# relationship may be more accurate here.
credit_report: "CreditReport"
@features
class CreditReport:
id: Primary[InsuranceApplication.id]
fico_score: int
application: InsuranceApplication
To query for a user’s credit report, you would write:
client.query(
inputs={User.id: "123"},
output=[User.policy.application.credit_report],
)
To write a resolver for one of the dependent feature classes here, such as CreditReport.fico_score
, you would still
reference the relevant feature class by itself:
@online
def get_fico_score(id: CreditReport.id) -> CreditReport.fico_score:
...
As an aside, if your resolver depends on features from other feature classes, such as User.ssn
, we instead recommend
joining those two feature classes directly for clarity (which was possible prior to this changelog entry):
from chalk import Primary
from chalk.features import features
from chalk.features import features, has_one
@features
class User:
id: str
# :tags: pii
ssn: str
policy: "InsurancePolicy"
credit_report: "CreditReport" = has_one(lambda: User.id == CreditReport.id)
# ... the rest of the feature classes
@online
def get_fico_score(id: User.id, ssn: User.ssn) -> User.credit_report.fico_score:
...
When you view Users in the Chalk settings page, you will now find a menu for viewing the roles associated with each user, whether those roles are granted directly or via SCIM.
We have shipped a new UI for the Features and Resolvers sections of the dashboard!
The new UI has tables with compact filtering and expanded functionality. You can now filter and sort by various resolver and feature attributes! The tables also provide column resizing for convenient exploration of the feature catalog.
The features table now includes request counts from the last 5 minutes up to the last 180 days,
has built-in sorting, and has a Features as CSV
button to download all the feature attributes
in your table as a CSV for further analysis.
The new chalk.functions
module contains several helper functions for feature computation. For example, if you have a
feature representing a raw value in GZIP-compressed value, you can use gunzip
with an underscore
reference to create an unzipped feature. The full list of available functions can be found at the bottom of our
underscore expression documentation.
You can now define features with JSON
as the type after importing JSON
from the chalk
module. You can then
reference the JSON feature in resolver and query definitions. You can also retrieve scalar values from JSON features
using the json_value
function.
By default, Chalk caches all feature values, including null. To prevent Chalk from caching null values, use the
feature
method and set cache_nulls
to False.
We built a way to statically interpret Python resolvers to identify ones that are eligible for C++ execution, which has faster performance. For now, resolvers are eligible if they do simple arithmetic and logical expressions. If you’re interested in learning more and seeing whether these new query planner options would apply to your codebase, please reach out!
We have a new tutorial for using Chalk with SageMaker available now. In the tutorial, we show how to use Chalk to generate training datasets from within a SageMaker pipeline for model training and evaluation.
In the August 19 changelog entry, we announced
NamedQuery
, a tool for naming your queries so that you can execute them without writing out
the full query definition.
This week, we’ve updated the dashboard’s feature catalog so that it shows which named queries reference a given feature as input or output.
We added a new Aggregations page to the dashboard where you can see the results of aggregate
backfill
commands. Check it out to see what resolvers were run for a backfill, the backfill’s
status, and other details that will help you drill down to investigate performance.
For more details on aggregate backfills, see our documentation on managing windowed aggregations.
Instead of writing out the full definition of your query each time you want to run it, you can now register a name for your query and reference it by the name!
Here’s an example of a NamedQuery:
from chalk import NamedQuery
from src.feature_sets import Book, Author
NamedQuery(
name="book_key_information",
input=[Book.id],
output=[
Book.id,
Book.title,
Book.author.name,
Book.year,
Book.short_description
],
tags=["team:analytics"],
staleness={
Book.short_description: "0s"
},
owner="mary.shelley@aol.com",
description=(
"Return a condensed view of a book, including its title, author, "
"year, and short description."
)
)
After applying this code, you can execute this query by its name:
chalk query --in book.id=1 --query-name book_key_information
To see all named queries defined in your current active deployment, use chalk named-query list
.
As Shakespeare once wrote, “What’s in a named query? That which we call a query by any other name would execute just as quickly.”
Previously, you could only reference one feature namespace in your queries. Now you can request features from multiple feature namespaces. For example, here’s a query for a specific customer and merchant:
client.query(
input={
Customer.id: 12345,
Merchant.id: 98765,
},
output=[Customer, Merchant],
)
The resources page of the dashboard now shows the allocatable and total CPU and memory for each of your Kubernetes nodes. Kubernetes reserves some of each machine’s resources for internal usage, so you cannot allocate 100% of a machine’s stated resources to your system. Now, you can use the allocatable CPU and memory numbers to tune your resource usage with more accuracy.
We identified an improvement for our query planner’s handling of temporal joins! Our logic for finding the most recent observation for a requested timestamp is now more efficient. Happy time traveling!
We now support DynamoDB as a native accelerated data source! After connecting your AWS credentials, Chalk automatically has access to your DynamoDB instance, which you can query with PartiQL.
Underscore expressions on windowed features can now include the special
expression _.chalk_window
to reference the target window duration. Use _.chalk_window
in windowed aggregation
expressions to define aggregations across multiple window sizes
at once:
@features
class Transaction:
id: int
user_id: "User.id"
amount: float
@features
class User:
id: int
transactions: DataFrame[Transaction]
total_spend: Windowed[float] = windowed(
"30d", "60d", "90d",
default=0,
expression=_.transactions[_.amount, _.ts > _.chalk_window].sum(),
materialization={"bucket_duration": "1d"},
)
offline_query
now supports the resources
parameter. resources
allows you to override the default resource requests associated with offline queries and cron jobs so that you can
control CPU, memory, ephemeral volume size, and ephemeral storage.Datasets and DatasetRevisions have two new methods:
preview and summary. preview
shows the first few rows of the
query output. summary
shows summary statistics of the query output. Here’s an example of summary
output:
describe user.id ... __index__ shard_id batch_id
0 count 1.0 ... 1.0 0 0
1 null_count 0.0 ... 0.0 0 0
2 mean 1.0 ... 0.0 0 0
3 std 0.0 ... 0.0 0 0
4 min 1.0 ... 0.0 0 0
5 max 1.0 ... 0.0 0 0
6 median 1.0 ... 0.0 0 0
[7 rows x 14 columns]
Chalk resource groups create separate independent deployments of the query server to prevent resource contention. For example, one team may want to make long-running analytics queries and another may want to make low-latency queries in line with customer requests.
We have updated the Cloud Resource Configuration page! You can now configure resource groups to use completely independent node pools to ensure your workflows run on separate computer hardware. The configuration page also allows you to specify exactly what kind of hardware will be available in each resource group so you can optimize the balance between cost and performance.
This feature is currently available for customers running Chalk in EKS, but will be available soon for customers using GKE.
count()
operations as native dataframe
operations.You can now view and filter features in the feature catalog by their tags and owners.
We shipped a gRPC engine for Chalk that improved performance by at least 2x through improved data serialization,
efficient data transfer, and a migration to our C++ server. You can now use ChalkGRPCClient
to run queries with the
gRPC engine and fetch enriched metadata about your feature classes and resolvers through the get_graph
method.
With ChalkPy v2.38.8 or later, you can now pass spine_sql_query to offline queries. The resulting rows of the SQL query will be used as input to the offline query. Chalk will compute an efficient query plan to retrieve your SQL data without requiring you to load the data and transform it into input before sending it back to Chalk. For more details, check out our documentation.
We shipped static planning of underscore expressions. Underscore expressions enable you to define and resolve features from operations on other features. When you use underscore expressions, we now do static analysis of your feature definition to transform it into performant C++ code.
Underscore expressions currently support basic arithmetic and logical operations, and we continue to build out more functionality! See the code snippet below for some examples of how to use underscore expressions:
@features
class SampleFeatureSet:
id: int
feature_1: int
feature_2: int
feature_1_2_sum: int = _.feature_1 + _.feature_2
feature_1_2_diff: int = _.feature_1 - _.feature_2
feature_1_2_equality: bool = _.feature_1 == _.feature_2
You can now add tags to your deployments. Tags must be unique to each of your environments. If you add an already existing tag to a new deployment, Chalk will remove the tag from your old deployment.
Tags can be added with the --deployment-tag
flag in the Chalk CLI:
chalk apply --deployment-tag=latest --deployment-tag=v1.0.4
We updated our UI for resource configuration management in the dashboard! You can now toggle your view between a GUI or a JSON editor. The GUI exposes all the configuration options available in the JSON editor, including values that aren’t set, and allows you to easily adjust your cluster’s resources to fit your needs.
We added integrations for Trino and Spanner as new data sources. We’ve also added native drivers for Postgres and Spanner, which drastically improves performance for these data sources.
We now have heartbeating to poll the status of long-running queries and resolvers, which will now mark any hanging runs that are no longer detected as “failed” after a certain period of time.
We expanded the functionality of our service tokens to enable role-based access control (RBAC) at both the data source and feature level. On the datasource level, you can now restrict a token to only access data sources with matching tags to resolve features. On the feature level, you can restrict a token’s access to tagged features either by blocking the token from returning tagged features in any queries but allowing the feature values to be used in the computation of other features, or by blocking the token from accessing tagged features entirely.
We shipped statuses during incremental runs such that users can get a signal of the current high water mark of data being updated.
chalk incremental status --scheduled_query get_some_data__daily
✓ Fetched resolver progress state
Resolver: N/A
Query: run_this_query_daily
Environment: chalk12345
Max Ingested Timestamp: 2024-07-01T16:01:46+00:00
Last Execution Timestamp: 2024-07-01T00:01:27.421873+00:00
Chalk now supports executing an offline_query
on a schedule. Effectively, this extends the existing “scheduled resolver”
functionality and allows you to execute more complicated data ingestion or caching workflows without needing to
use Airflow or other external schedulers to orchestrate resolver execution.
Here’s an example of a scheduled query that caches the number of transactions a user has made in the last 24 hours into the online store:
from chalk import ScheduledQuery
ScheduledQuery(
name="num_transactions_last_24h",
output=[User.num_transactions_last_24h],
schedule="0 0 * * *", # every day at midnight
store_online=True, # store the result in the online store
store_offline=False, # don't store this value in the offline store
)
offline_query(...)
now accepts sample_features: list[Feature]
as an argument. This works in conjunction
with recompute_features
, and allows you to write something like:ChalkClient().offline_query(
input={User.id: [...]},
output=[User.full_name],
recompute_features=True, # means "recompute all features
sample_features=[User.first_name, User.last_name] # but sample these features from the offline store
)
This is useful when you have a large number of features that you want to recompute, but only a few that you want to sample.
ChalkClient.offline_query
now accepts run_asynchronously: bool
to explicitly opt a query into running on an isolated worker.DataSet.to_polars()/.to_pandas()
now accept output_ts: str
and output_id: str
to customize the name of the timestamp and id columns in the output dataframe.chalk apply
is roughly twice as fast as of chalkpy v2.33.9
.ChalkClient.query
now supports request_timeout: float
, which is passed to the underlying requests.request
call.A persistent issue with chalk drop
has been resolved. Now, chalk drop
will allow you to reset a feature whose
deletion has been deployed to the active deployment, which will allow you to re-deploy the feature. Previously,
it was possible to get into a state that was impossible to recover from without support.
tags(...)
allows you to extract the tags
of a @features
class or a property (Feature
) of that class.
DataSet.to_polars()/to_pandas()
now raises an error if the dataset computation had errors. This
prevents the user from accidentally using a dataset that was not computed correctly. If you wish to use the dataset
anyway, you can use DataSet.to_polars(ignore_errors=True)
.
You can now specify a custom SQL sampling query for offline queries. This allows you to use a native SQL query to compute the query’s entity spine for offline queries. This is useful when you have a complicated sampling policy (i.e. class-based sampling). Additional non-primary key features can be provided as well.
You can now specify required_resolver_tags
when querying. This allows you to ensure that a query only considers a
resolver if it has a certain tag. This is useful for guaranteeing that a query only uses resolvers that are
cost-efficient, or for enforcing certain compliance workflows.
In this example:
@offline()
def fetch_credit_scores() -> DataFrame[User.id, User.credit_score]:
"""
Call bureaus to get credit scores; costs money for each record retrieved.
"""
return requests.post(...)
@offline(tags=["low-cost"])
def fetch_previously_ingested_credit_scores() -> DataFrame[User.id, User.credit_score]:
"""
Pull previously retrieved credit scores from Snowflake only
"""
return snowflake.query_string("select user_id as id, credit_score from ...").all()
querying with required_resolver_tags
can be used to enforce that only ‘low-cost’ resolvers are executed —
# This query is guaranteed to /never/ run any resolver that isn't tagged "low-cost".
dataset = ChalkClient().offline_query(
input={User.id:[1,2,3]},
output=[
User.credit_score
],
recompute_features=True,
required_resolver_tags=["low-cost"]
)
You can now use either of Python 3.11 or 3.10 on a per-environment basis.
project: my-project-id
environments:
default:
runtime: 3.10
develop:
runtime: 3.11
See Python Version for more information.
ChalkClient.query_bulk(...)
and multi_query
no longer require that references features
be defined as Python classes, and string names for inputs and outputs can now be used instead.Alerts now support descriptions, which can be used to provide more context about the alert.
from chalk.monitoring import Chart, Series
Chart(name="Request count").with_trigger(
Series
.feature_null_ratio_metric()
.where(feature=User.fico_score) > 0.2,
description="""*Debugging*
When this alert is triggered, we're parsing null values from
a lot of our FICO reports. It's likely that Experian is
having an outage. Check the <dashboard|https://internal.dashboard.com>.
"""
)
These descriptions can also be set in the Chalk dashboard via the metric alerts interface.
The query_bulk
method is now available in the ChalkClient
class. This method allows you to query for multiple rows of
features at once.
This method uses Apache Arrow’s Feather format
to encode data. This allows the endpoint to transmit data (particularly numeric-heavy data) using roughly 1/10th the bandwidth
that is required for the JSON format used by query
.
This method has been available in beta for a few months, but is now available for general use, and as part of this release is now supported when querying using notebooks without access to feature schemas.
The list of scheduled resolvers now shows which resolvers are actually scheduled to
run in the current environment, based on the environment
argument to @online
and @offline
.
Resolvers that are annotated with an environment
other than the current environment are labeled with the
environment in which they are configured to run.
The chalk query command now has improved output for errors. Previously, errors were displayed in a table, which meant that stacktraces were truncated:
> chalk query --in email.normalized=nice@chalk.ai --out email
Errors
Code Feature Resolver Message
─────────────────────────────────────────────────────────────────────────────
RESOLVER_FAILED src.resolvers.get_fraud_tags KeyError: 'tags'
Now, errors are displayed in a more readable format, and stacktraces are not truncated:
> chalk query --in email.normalized=nice@chalk.ai --out email
Errors
Resolver Failed src.resolvers.get_fraud_tags
KeyError: 'tags'
File "src/resolvers.py", line 30, in get_fraud_tags
return parsed["tags"]
KeyError('tags')
The query plan viewer now includes a flame graph visualization of the query plan’s execution, called the Trace View. Precise trace
data is stored for every offline query by default and for online queries when the query is made with the --explain
flag.
now=
for .query
, --now
, etc.offline_query
now supports running downstream resolvers when no input is provided. Query primary keys will be sampled or computed, depending on the value of recompute_features
.online_query
now support running a query without any input. Query primary keys will be computed using an appropriate no-argument resolver that returns a DataFrame[...]
--local
for chalk query
, combines chalk apply --branch
and chalk query --branch
chalk
command line tool is no longer an off-brand magenta.Added: .to_polars()
, to_pandas()
, and .to_pyarrow()
accept prefixed: bool
as an argument. prefixed=True
is the
default behavior, and will prefix all column names with the feature namespace. prefixed=False
will not prefix column names.
DataFrame({User.name: ["Andy"]}).to_polars(prefixed=False)
# output:
# polars DataFrame with `name` as the sole column.
DataFrame({User.name: ["Andy"]}).to_polars(prefixed=True)
# output:
# polars DataFrame with `user.name` as the sole column.
Added: include_meta
on ChalkClient.query(...)
, which includes .meta
on the response object. This metadata object
includes useful information about the query execution, at the cost of increased network payload size and a small
increase in latency.
Chalk now supports freezing time in unit tests. This is useful for testing time-dependent resolvers.
from datetime import timezone, datetime
from chalk.features import DataFrame, after
from chalk.features.filter import freeze_time
df = DataFrame([...])
with freeze_time(at=datetime(2020, 2, 3, tzinfo=timezone.utc)):
df[after(days_ago=1)] # Get items after february 2nd
freeze_time
also works with resolvers that declare specific time bounds for their aggregation inputs:
@online
def get_num_transactions(txs: Card.transactions[before(days_ago=1)]) -> Card.num_txs:
return len(txs)
with freeze_time(at=datetime(2020, 9, 14)):
num_txs = get_num_transactions(txs) # num transactions before september 13th
Chalk now supports resolvers that are explicitly time-dependent. This is useful for performing backfills which
compute values that depend on values that are semantically similar to datetime.now()
.
You can express time-dependency by declaring a dependency on a special feature called Now
:
@online
def get_age_in_years(birthday: User.birthday, now: Now) -> User.age_in_years:
return (now - birthday).years
In online query, (i.e. with ChalkClient().query
), Now
is datetime.now()
. In offline query contexts,
now
will be set to the appropriate input_time
value for the calculation. This allows you to backfill
a feature for a single entity at many different historical time points:
ChalkClient().offline_query(input={User.id: [1,1,1]}, output=[User.age_in_years], input_times=[
datetime.now() - timedelta(days_ago=100),
datetime.now() - timedelta(days_ago=50),
datetime.now() - timedelta(days_ago=0),
])
...
Now
can be used in batch resolvers as well:
@online
def batch_get_age_in_years(df: DataFrame[User.id, User.birthday, Now]) -> DataFrame[User.id, User.age_in_years]:
...
SQL file resolvers are Chalk’s preferred method of resolving features with SQL queries. Now, you can get your SQL file resolvers in Python by the name of the SQL file resolver. For example, if you have the following SQL file resolver:
-- source: postgres
-- cron: 1h
-- resolves: Person
select id, name, email, building_id from table where id=${person.id}
you can test out your resolver with the following code.
from chalk import get_resolver
resolver = get_resolver('example') # get_resolver('example.chalk.sql') will also work
result = resolver('my_id')
Now, Chalk supports exporting metrics about “named query” execution. These metrics (count, latency) join similar metrics about feature and resolver execution. Contact your Chalk Support representative to configure metrics export if you would like to view metrics about Chalk system execution in your existing metrics dashboards.
Additional updates:
Chalk Branch Deployments provide an excellent experience for quick iteration cycles on new features and resolvers. Now, Chalk Branch Deployments automatically use a pool of “standby” workers, so there is less delay before queries can be served against a new deployment. This reduces the time it takes to run query or offline query against a new deployment from ~10-15 seconds to ~1-3 seconds. This impacts customers with more complex feature graphs the most.
Stream resolvers support a keys=
parameter. This parameter allows you to re-key a stream by a property of the message,
rather than relaying on the protocol layer key. This is appropriate if a stream is keyed randomly, or by an entity key like “user”,
but you want to aggregate along a different axis, e.g. “organization”.
Now, keys=
supports passing a “dotted string” (e.g. foo.bar
) to indicate that Chalk should use a sub-field of your
message model. Previously, only root-level fields of the model were supported.
If you specify projections or filters in
DataFrame
arguments of resolvers, Chalk will
automatically project out columns and filter rows in
the input data.
Below, we test a resolver that filters rooms in a house to only the bedrooms:
@features
class Room:
id: str
name: str
@features
class Home:
id: str
rooms: DataFrame[Room] = has_many(
lambda: Room.home_id == Home.id
)
num_bedrooms: int
@online
def get_num_bedrooms(
rooms: Home.rooms[Room.name == 'bedroom']
) -> Home.num_bedrooms:
return len(rooms)
Now, we may want to write a unit test for this resolver.
def test_get_num_rooms():
# Rooms is automatically converted to a `DataFrame`
rooms = [
Room(id=1, name="bedroom"),
Room(id=2, name="kitchen"),
Room(id=3, name="bedroom"),
]
# The kitchen room is filtered out
assert get_num_bedrooms(rooms) == 2
# `get_num_bedrooms` also works with a `DataFrame`
assert get_num_bedrooms(DataFrame(rooms)) == 2
While we could have written this test before, we would
have had to manually filter the input data to only
include bedrooms.
Also note that Chalk will automatically convert our argument
to a DataFrame
if it is not already one.
Chalk’s dashboard shows aggregated logs and metrics about the execution of queries and resolvers. Now, it can also show detailed metrics for a single query. This is useful for debugging and performance tuning.
You can access this page from the “runs” tab on an individual named query page, or from the “all query runs” link on the “queries” page.
You can search the list of previously executed queries by date range, or by “query id”. The query id is returned in the “online query” API response object.
Chalk now supports BigTable as an online-storage implementation. BigTable is appropriate for customers with large working sets of online features, as is common with recommendation systems. We have successfully configured BigTable to serve 700,000 feature vectors per second at ~30ms p90 e2e latency.
The Offline Query has been enhanced with a new recompute_features
parameter. Users can control which features are sampled from the offline store, and which features are recomputed.
False
will maintain current behavior, returning only samples from the offline store.True
will ignore the offline store, and execute @online
and @offline
resolvers to produce the requested output.recompute_features
, those features will be recomputed by running @online
and @offline
resolvers, and all other feature values - including those needed to recompute the requested features - will be sampled from the offline store.The ‘recompute’ capability is also exposed on Dataset. When passed a list of features to recompute, a new Dataset Revision will be generated, and the existing dataset will be used as inputs to recompute the requested features.
Chalk has introduced a new workflow when working with branches, allowing full iterations to take place directly in any IPython notebook. When a user creates a Chalk Client with a branch in a notebook, subsequent features and resolvers in the notebook will be deployed to that branch. When combined with Recompute Dataset and the enhancements to Offline Query, users have a new development loop available for feature exploration and development:
Deployments now offer the ability to view their source code. By clicking the “View Source” button on the Deployment Detail page, users can view all files included in the deployed code.
Users can now “redeploy” any historical deployement with a UI button on the deployment details page. This enables useful workflows including rollbacks. The “download source” button downloads a tarball containing the deployed source to your local machine.
When writing resolvers, incorrect typing can be a difficult to track. Now, if a resolver instantiates a feature of an incorrect type, the resolver error message will include the primary key value(s) of the query itself.
The Online Query API can now be used to query DataFrame-typed features. For instance, you can query all of a user’s transaction level features in a single query:
chalk query --in user.id --out user.transactions
{
"columns": ["transaction.id", "transaction.user_id", ...],
"values": [[1, 2, 3, ...], ["user_1", "user_2", "user_3", ...]
}
More functionality will be added to Online and Offline query APIs to support more advanced query patterns.
When deploying with chalk apply
a new flag --branch <branch_name>
has been introduced which creates a branch deployment.
Users can interact with their branch deployment using a consistent name by passing the branch name to query, upload_features, etc.
Chalk clients can also be scoped to a branch by passing the branch in the constructor.
Branch deployments are many times faster than other flavors of chalk apply
, frequently taking only a few seconds from beginning to end.
Branch deployments replace preview deploys, which have been deprecated.
Deployments via chalk apply
are now up to 50% faster in certain cases. If your project’s PIP dependencies haven’t changed, new deployments will build & become active significantly faster than before.
Introduces a new “offline_ttl” property to features decorator . Now you can control for how long data is valid in the offline_store. Any feature older than the ttl value will not be returned in an offline query.
@features
class MaxOfflineTTLFeatures:
id: int
ts: datetime = feature_time()
no_offline_ttl_feature: int = feature(offline_ttl=timedelta(0))
one_day_offline_ttl_feature: int = feature(offline_ttl=timedelta(days=1))
infinite_ttl_feature: int
Adds the strict
property to features decorator, indicating that any failed validation will throw an error. Invalid features will never be written to the online or offline store is strict
is True
. Also introduces the validations
array to allow differentiated strict and soft validations on the same feature.
@features
class ClassWithValidations:
id: int
name: int = feature(max=100, min=0, strict=True)
feature_with_two_validations: int = feature(
validations=[
Validation(min=70, max=100),
Validation(min=0, max=100, strict=True),
]
)
The Dataset
class is now live!
Using the new ChalkClient.offline_query
method,
we can inspect important metadata about the query and
retrieve its output data in a variety of ways.
Simply attach a dataset_name
to the query to persist the results.
from chalk.client import ChalkClient, Dataset
uids = [1, 2, 3, 4]
at = datetime.now()
dataset: Dataset = ChalkClient().offline_query(
input={
User.id: uids,
},
input_times=[at] * len(uids),
output=[
User.id,
User.fullname,
User.email,
User.name_email_match_score,
],
dataset_name='my_dataset'
)
pandas_df: pd.DataFrame = dataset.data_as_pandas
Check out the documentation here.
Chalk now provides access to build and boot logs through the Deployments page in the dashboard.
Computing features associated with third-party services can be unpredictably slow. Chalk helps you manage such uncertainty by specifying a resolver timeout duration.
Now you can set timeouts for resolvers!
@online(timeout="200ms")
def resolve_australian_credit_score(driver_id: User.driver_id_aus) -> User.credit_score_aus:
return experian_client.get_score(driver_id)
SQL-integrated resolvers can be completely written in SQL files: no Python required! If you have a SQL source like as follows:
pg = PostgreSQLSource(name='PG')
You can define a resolver in a .chalk.sql
file, with comments that detail important
metadata. Chalk will process it upon chalk apply
as it would any other Python resolver.
-- type: online
-- resolves: user
-- source: PG
-- count: 1
select email, full_name from user_table where id=${user.id}
Check out the documentation here.
Logging on your dashboard has been improved. You can now scroll through more logs, and the formatting is cleaner and easier to use. This view is available for resolvers and resolver runs.
Online Query Response objects now support pretty-print in any iPython environment.
chalkpy
has always supported running in docker images using M1’s native arm64
architecture, and now
chalkpy==1.12.0
supports most functionality on M1 Macs when run with AMD64 (64 bit Linux) architecture docker images.
This is helpful when testing images built for Linux servers that include chalkpy
.
Chalk has lots of documentation, and finding content is now difficult.
We’ve added docs search!
Try it out by typing cmd-K
, or clicking the search button at the top of the
table of contents.
This update makes several improvements to feature discovery.
Tags and owners are now parsed from the comments preceding the feature definition.
@features
class RocketShip:
# :tags: team:identity, priority:high
# :owner: katherine.johnson@nasa.gov
velocity: float
...
Prior to this update, owners and tags needed to be set in the feature(...)
function:
@features
class RocketShip:
velocity: float = feature(
tags=["team:identity", "priority:high"],
owner="katherine.johnson@nasa.gov"
)
...
Feel free to choose either mechanism!
It’s natural to name the primary feature of a feature class
id
. So why do you always have to specify it?
Until now, you needed to write:
@features
class User:
id: str = feature(primary=True)
...
Now you don’t have to! If you have a feature class that does
not have a feature with the primary field set, but has a feature
called id
, it will be assigned primary
automatically:
@features
class User:
id: str
...
The functionality from before sticks around:
if you use a field as a primary key with a name other than
id
, you can keep using it as your primary feature:
@features
class User:
user_id: str = feature(primary=True)
# Not really the primary key!
id: str
The Chalk DataFrame
now supports boolean expressions!
The Chalk team has worked hard to let you express your
DataFrame
transformations in natural, idiomatic Python:
DataFrame[
User.first_name == "Eleanor" or (
User.email == "eleanor@whitehouse.gov" and
User.email_status not in {"deactivated", "unverified"}
) and User.birthdate is not None
]
Python experts will note that or
, and
, is
, is not
, not in
, and not
aren’t overload-able.
So how did we do this?
The answer is AST parsing! A more detailed blog post to follow.
This update makes several improvements to feature discovery.
Descriptions are now parsed from the comments preceding
the feature definition. For example, we can document the feature
User.fraud_score
with a comment above the attribute definition:
@features
class User:
# 0 to 100 score indicating an identity match.
# Low scores indicate safer users
fraud_score: float
...
Prior to this update, descriptions needed to be set in the feature(...)
function:
@features
class User:
fraud_score: float = feature(description="""
0 to 100 score indicating an identity match.
Low scores indicate safer users
""")
...
The description passed to feature(...)
takes precedence over the
implicit comment description.
You can now set attributes for all features in a namespace!
Here, we assign the tag group:risk
and the owner ravi@chalk.ai
to all features on the feature class.
Owners specified at the feature level take precedence
(so the owner of User.email
is the default ravi@chalk.ai
whereas the
owner of User.flaky_api_result
is devops@chalk.ai
).
Tags aggregate, so email
has the tags pii
and group:risk
.
@features(tags="group:risk", owner="ravi@chalk.ai")
class User:
email: str = feature(tags="pii")
flaky_api_result: str = feature(owner="devops@chalk.ai")
You can configure Chalk to post message to your Slack workspace! You can find the Slack integration tab in the settings page of your dashboard.
Slack can be used as an alert channel or for build notifications.
Chalk’s pip package now supports Python 3.8! With this change, you can use the Chalk package to run online and offline queries in a Python environment with version >= 3.8. Note that your features will still be computed on a runtime with Python version 3.10.
Chalk’s injects environment variables to support data integrations. But what happens when you have two data sources of the same kind? Historically, our recommendation was to create one set of environment variables through an official data source integration, and one set of prefixed environment variables yourself using the generic environment variable support.
With the release of named integrations, you can connect to as many
of the same data source as you need!
Provide a name at the time of configuring your data source,
and reference it in the code directly.
Named integrations inject environment variables with the standard names
prefixed by the integration name (ie. RISK_PGPORT
).
The first integration of a given kind will also create the un-prefixed environment
variable (ie. both PGPORT
and RISK_PGPORT
).
Chalk is excited to announce the availability of our SOC 2 Type 1 report from Prescient Assurance. Chalk has instituted rigorous controls to ensure the security of customer data and earn the trust of our customers, but we’re always looking for more ways to improve our security posture, and to communicate these steps to our customers. This report is one step along our ongoing path of trust and security.
If you’re interested in reviewing this report, please contact support@chalk.ai to request a copy.
You can now convert Chalk’s DataFrame
to a pandas.DataFrame
and back!
Use the methods chalk_df.to_pandas()
and .from_pandas(pandas_df)
.
The 1.4.1 release of the CLI added a parameter --sample
to chalk migrate
.
This flag allows migrations to be run targeting specific sample sets.
Added spark lines to the feature and resolver tables which show a quick summary of request counts over the past 24 hours. Added status to feature and resolver tables which show any failing checks related to a feature or resolver.