Resolvers
Using Chalk expressions to define features
Chalk Expressions let you define features declaratively, using symbolic computation over your data. While you write expressions in idiomatic Python, they are compiled and executed as vectorized C++, enabling low-latency computation at serve time and high-throughput processing at train time.
Expressions support a wide range of operations, including arithmetic, filtering, aggregations, and built-in functions like .
For example, in a Transaction feature class, we can compute the subtotal of a transaction as the difference between its total and sales tax:
from chalk.features import features, Primary
from chalk import _
@features
class Transaction:
id: int
total: float
sales_tax: float
subtotal: float = _.total - _.sales_tax
The _
symbol refers to the current scope (here, the feature class Transaction
)
and is used to reference other fields on the same instance.
Expressions like _.total - _.sales_tax
are compiled into native execution
plans that run efficiently in production.
In addition to referencing fields on the same object, you can traverse relationships.
If each Transaction
is associated with a User
, for example, you can compute a
string similarity between the user’s name and the transaction memo:
import chalk.functions as F
@features
class Transaction:
...
memo: str
user: "User"
name_match_score: float = F.jaccard_similarity(
_.user.name, _.memo
)
Here, _.user.name
follows the foreign key relationship from Transaction
to User
.
The function F.jaccard_similarity
is one of many built-in Chalk functions that
operate on symbolic expressions.
Expressions can also perform aggregations over related records.
In a User
feature class, we can compute aggregates like the
number of large transactions or the total amount spent:
from chalk import _
from chalk.features import DataFrame, features
@features
class User:
id: int
name: str
transactions: DataFrame[Transaction]
num_large_txns: int = _.transactions[_.total > 1000].count()
total_spend: float = _.transactions[_.total].sum()
In this context, _
refers to the User
instance when referring to _.transactions
.
But when you apply a filter, like _.transactions[_.total > 1000]
,
the expression inside the brackets is evaluated in the context of each individual Transaction
.
That means _.total
refers to the total field on each Transaction
, not on the User
.
This scoped evaluation makes it easy to filter, project, and aggregate over related data.
All expressions are statically analyzed, optimized to eliminate redundant computation, and executed as high-performance C++ at runtime.
Chalk expressions support a wide range of built-in functions for manipulating data, performing calculations, and transforming features. These functions can be used in expressions to operate on feature values, DataFrames, and other data types.
from chalk import features, online
from chalk.features import _, features
@features
class Transaction:
id: int
total: float
sales_tax: float
subtotal: float = _.total - _.sales_tax
@online
def get_subtotal(total: Transaction.total, sales_tax: Transaction.sales_tax) -> Transaction.subtotal:
return total - sales_tax
Chalk expressions support a variety of infix operators for arithmetic, conditions, and boolean logic.
Operator | Description | Example |
---|---|---|
+ | Addition | _.total + _.sales_tax |
- | Subtraction | _.total - _.sales_tax |
* | Multiplication | _.quantity * _.price |
/ | Division | _.total / _.quantity |
> | Greater than | _.total > 1000 |
>= | Greater than or equal | _.total >= 1000 |
< | Less than | _.total < 1000 |
<= | Less than or equal | _.total <= 1000 |
== | Equal | _.status == "completed" |
!= | Not equal | _.status != "completed" |
& | Boolean and | _.is_active & _.is_verified |
| | Boolean or | _.is_active | _.is_verified |
~ | Boolean not | ~_.is_active |
Don't use `or`, `and`, `not`, and `is` in expressions
The chalk.functions
module exposes several helpful functions that can be used in
combination with expression references to transform features. These functions are meant
to be used in expressions and are not available as standalone functions. To view all
available functions, see our SDK docs
Expressions can be used to access nested attributes from other features in a feature class, whether these
other features are struct dataclasses
, features
, or DataFrames
.
import chalk.functions as F
from chalk.features import features
from dataclasses import dataclass
@dataclass
class LatLon:
lat: float | None
lon: int | None
@features
class User:
id: int
home: LatLon
work: LatLon
commute_distance: float = F.haversine(
lat1=_.home.lat,
lon1=_.home.lon,
lat2=_.work.lat,
lon2=_.work.lon,
)
You can create custom functions to encapsulate complex logic or reusable computations in your expressions. For example, if you wanted to apply consistent windows across many features, you could define a custom function like this:
Use helper functions to create expressions
from chalk import _
from chalk.features import features, DataFrame
def count_where(*filters):
return _.transactions[_.created_at > _.chalk_window, *filters].count()
@features
class User:
id: int
transactions: DataFrame[Transaction]
num_large_transactions: int = count_where(_.total > 1000)
num_small_transactions: int = count_where(_.total < 100)
Don't use expressions in Python resolvers
@features
class User:
id: int
name: str
email: str
name_email_match_score: float
@online
def get_score(
name: User.name,
email: User.email,
) -> User.name_email_match_score:
# DONT DO THIS!!
return F.jaccard_similarity(name, email)
Expressions are not supported in Python resolvers, so you cannot use Chalk functions like F.jaccard_similarity
in a Python resolver. Instead, use expressions to define the feature directly in the feature class.
@features
class User:
id: int
name: str
email: str
name_email_match_score: float = F.jaccard_similarity(
_.name, _.email
)
DataFrame features can be filtered with expressions.
Extending our Transaction
example, we can create a User
feature class with a has-many relationship
to Transaction
. Then, we can define a feature representing the number of large purchases by referencing the existing
User.transactions
feature:
from chalk.features import _, features, DataFrame
@features
class Transaction:
id: int
user_id: "User.id"
total: float
sales_tax: float
subtotal: float = _.total - _.sales_tax
@features
class User:
id: int
# implicit has-many relationship with Transaction due to `user_id` above
transactions: DataFrame[Transaction]
num_large_transactions: int = _.transactions[_.total > 1000].count()
The object referenced by _
changes depending on its current scope. In this code, the _
in _.transactions
references the User
object.
Within the DataFrame
filter, the _
in _.total
references each Transaction
object as each one is evaluated.
The count
aggregation is covered in the next section.
DataFrame features support projection with expressions, which produce a new DataFrame scoped down to the referenced columns. DataFrames can be aggregated after eligible columns are selected.
With our Transaction
example, we already saw a count
aggregation for counting the number of large transactions. We
can add another aggregation for computing the user’s total spend:
from chalk.features import _, features, DataFrame
@features
class Transaction:
id: int
user_id: "User.id"
sales_tax: float
subtotal: float
total: float = _.subtotal + _.sales_tax
@features
class User:
id: int
transactions: DataFrame[Transaction]
num_large_transactions: int = _.transactions[_.total > 1000].count()
total_spend: float = _.transactions[_.total].sum()
To compute User.total_spend
, we needed to create a projection of the User.transactions
DataFrame
limited to only the Transaction.total
column so that the sum
aggregation could work.
In contrast, no projection was needed for the num_large_transactions
aggregation because count
works on DataFrames
with any number of columns.
Use materialized aggregations
For computing low-latency aggregations over high volumes of data, Chalk also offers materialized windowed aggregations that uses materialization of buckets of data to compute large aggregations efficiently.
Aggregation functions have varying behavior when handling None
values and empty DataFrames
.
If an aggregation function says None
values are skipped in the table below,
it will consider a DataFrame
with only None
values as empty.
Function | None values | Empty DataFrame | Notes |
---|---|---|---|
sum | Skipped | Returns 0 | |
min | Skipped | Returns None | |
max | Skipped | Returns None | |
mean | Skipped | Returns None | Feature type must be float or float \| None . None values are skipped, meaning they are not included in the mean calculation. |
count | Included | Returns 0 | |
any | Skipped | Returns False | |
all | Skipped | Returns True | |
std | Skipped | See notes | Standard deviation. Requires at least 2 values. For DataFrames with less than 2 values, returns None . Aliases: stddev , stddev_sample , std_sample . |
var | Skipped | See notes | Variance. Same requirements as std . Alias: var_sample . |
approx_count_distinct | Skipped | Returns 0 | |
approx_percentile | Skipped | Returns None | Takes one argument, quantile , expected to be a float in the range [0, 1] . Example: approx_percentile(0.75) returns a value approximately equal to the 75th percentile of the not-None values in the DataFrame. |
approx_top_k | Skipped | Returns None | Takes one argument, k , expected to be a positive integer, as a keyword argument. Example: approx_top_k(k=25) returns the 25 or fewer approximately-most common values. |
For aggregations that can return None
, either mark the feature as optional
(for example, by setting the feature type to float | None
) or use coalesce
to fall back to a default value.
To specify run conditions such as environment, tags, and versions for a feature
that is resolved through an expression,
you can use the feature
function and pass in the expression as an argument.
from chalk.features import Primary, features, feature
@features
class User:
id: int
purchases: DataFrame[Purchase]
# Uses a default value of 0 when one cannot be computed.
num_purchases: int = feature(
expression=_.purchases.count(),
default=0,
environment=["staging", "dev"],
tags=["fraud", "credit"],
version=1,
)
To test your expressions, we recommend setting up integration tests or iterating on a branch.
In some cases, you may want to build expressions dynamically. For example, if you have a rules engine that generates expressions and stores them in a database, you can load those expressions at runtime and ask Chalk to compute their values.
Consider this User
and Transaction
model, which would be checked in to the code:
from chalk.features import features, DataFrame
@features
class Transaction:
id: int
user_id: "User.id"
user: "User"
total: float
sales_tax: float
@features
class User:
id: int
transactions: DataFrame[Transaction]
name: str
email: str
Using the Golang or Python SDKs, you can compute both scalar and aggregate features on demand.
For example, in Golang, if we wanted to compare the lowercased name and email by their Jaccard similarity, we could build the expression dynamically and ask Chalk to compute it:
package main
import (
"testing"
"github.com/zeebo/assert"
"github.com/chalk-ai/chalk-go/expr"
"github.com/chalk-ai/chalk-go"
)
func TestQueryingExpressions(t *testing.T) {
// Picks up the ambient credentials from the `chalk login` run on the CLI.
client, err := chalk.NewGRPCClient(t.Context())
assert.NoError(t, err)
result, err := client.OnlineQueryBulk(
t.Context(),
chalk.OnlineQueryParams{}.
WithInput("user.id", []int{1}).
WithOutputs("user.name").
WithOutputs("user.email").
WithOutputExprs(
expr.FunctionCall(
"jaccard_similarity",
expr.FunctionCall("lower", expr.Col("name")),
expr.FunctionCall("lower", expr.Col("email")),
).
As("user.name_email_sim"),
),
)
assert.NoError(t, err)
row, err := result.GetRow(0)
assert.NoError(t, err)
for feature, value := range row.Features {
t.Logf("Feature: %s, Value: %+v", feature, value.Value)
}
}
This program produces output like this:
=== RUN TestChalkClient
chalk_test.go:46: Feature: user.name, Value: Nicole Mann
chalk_test.go:46: Feature: user.email, Value: nicoleam@nasa.gov
chalk_test.go:46: Feature: user.name_email_sim, Value: 0.5714285714285714
--- PASS: TestChalkClient (0.43s)
The feature user.name_email_sim
was computed using the expression
jaccard_similarity(lower(name), lower(email))
, which was built dynamically
using the expr
package.
For a complete list of functions, see our SDK docs.
All the functions available in expressions are also available in the expr
package.
You can also compute aggregations dynamically. Using the Golang SDK, we can compute the number of large transactions for a user:
result, err := client.OnlineQueryBulk(
t.Context(),
chalk.OnlineQueryParams{}.
WithInput("user.id", []int{1}).
WithOutputExprs(
expr.DataFrame("transactions").
Filter(expr.Col("amount").Gt(expr.Float(0.))).
Agg("count").
As("user.positive_transaction_count"),
),
)
For this program, the output would look like this:
=== RUN TestChalkClient
chalk_test.go:42: Feature: user.positive_transaction_count, Value: 33
--- PASS: TestChalkClient (0.45s)
In a Python expression, the above SDK call is equivalent to:
User.positive_transaction_count = _.transactions[_.amount > 0].count()
Both the scalar functions and aggregations can be computed in Python SDK. See this guide for more details.