Resolvers
Using Chalk expressions to define features
Chalk Expressions let you define features declaratively, using symbolic computation over your data. While you write expressions in idiomatic Python, they are compiled and executed as vectorized C++, enabling low-latency computation at serve time and high-throughput processing at train time.
Expressions support a wide range of operations, including arithmetic, filtering, aggregations, and built-in functions like .
For example, in a Transaction feature class, we can compute the subtotal of a transaction as the difference between its total and sales tax:
from chalk.features import features, Primary
from chalk import _
@features
class Transaction:
id: int
total: float
sales_tax: float
subtotal: float = _.total - _.sales_tax
The _
symbol refers to the current scope (here, the feature class Transaction
)
and is used to reference other fields on the same instance.
Expressions like _.total - _.sales_tax
are compiled into native execution
plans that run efficiently in production.
In addition to referencing fields on the same object, you can traverse relationships.
If each Transaction
is associated with a User
, for example, you can compute a
string similarity between the user’s name and the transaction memo:
import chalk.functions as F
@features
class Transaction:
...
memo: str
user: "User"
name_match_score: float = F.jaccard_similarity(
_.user.name, _.memo
)
Here, _.user.name
follows the foreign key relationship from Transaction
to User
.
The function F.jaccard_similarity
is one of many built-in Chalk functions that
operate on symbolic expressions.
Expressions can also perform aggregations over related records.
In a User
feature class, we can compute aggregates like the
number of large transactions or the total amount spent:
from chalk import _
from chalk.features import DataFrame, features
@features
class User:
id: int
name: str
transactions: DataFrame[Transaction]
num_large_txns: int = _.transactions[_.total > 1000].count()
total_spend: float = _.transactions[_.total].sum()
In this context, _
refers to the User
instance when referring to _.transactions
.
But when you apply a filter, like _.transactions[_.total > 1000]
,
the expression inside the brackets is evaluated in the context of each individual Transaction
.
That means _.total
refers to the total field on each Transaction
, not on the User
.
This scoped evaluation makes it easy to filter, project, and aggregate over related data.
All expressions are statically analyzed, optimized to eliminate redundant computation, and executed as high-performance C++ at runtime.
Chalk expressions support a wide range of built-in functions for manipulating data, performing calculations, and transforming features. These functions can be used in expressions to operate on feature values, DataFrames, and other data types.
from chalk import features, online
from chalk.features import _, features
@features
class Transaction:
id: int
total: float
sales_tax: float
subtotal: float = _.total - _.sales_tax
@online
def get_subtotal(total: Transaction.total, sales_tax: Transaction.sales_tax) -> Transaction.subtotal:
return total - sales_tax
Chalk expressions support a variety of infix operators for arithmetic, conditions, and boolean logic.
Operator | Description | Example |
---|---|---|
+ | Addition | _.total + _.sales_tax |
- | Subtraction | _.total - _.sales_tax |
* | Multiplication | _.quantity * _.price |
/ | Division | _.total / _.quantity |
> | Greater than | _.total > 1000 |
>= | Greater than or equal | _.total >= 1000 |
< | Less than | _.total < 1000 |
<= | Less than or equal | _.total <= 1000 |
== | Equal | _.status == "completed" |
!= | Not equal | _.status != "completed" |
& | Boolean and | _.is_active & _.is_verified |
| | Boolean or | _.is_active | _.is_verified |
~ | Boolean not | ~_.is_active |
Don't use `or`, `and`, `not`, and `is` in expressions
The chalk.functions
module exposes several helpful functions that can be used in
combination with expression references to transform features. These functions are meant
to be used in expressions and are not available as standalone functions. To view all
available functions, see our SDK docs
Expressions can be used to access nested attributes from other features in a feature class, whether these
other features are struct dataclasses
, features
, or DataFrames
.
import chalk.functions as F
from chalk.features import features
from dataclasses import dataclass
@dataclass
class LatLon:
lat: float | None
lon: int | None
@features
class User:
id: int
home: LatLon
work: LatLon
commute_distance: float = F.haversine(
lat1=_.home.lat,
lon1=_.home.lon,
lat2=_.work.lat,
lon2=_.work.lon,
)
You can create custom functions to encapsulate complex logic or reusable computations in your expressions. For example, if you wanted to apply consistent windows across many features, you could define a custom function like this:
Use helper functions to create expressions
from chalk import _
from chalk.features import features, DataFrame
def count_where(*filters):
return _.transactions[_.created_at > _.chalk_window, *filters].count()
@features
class User:
id: int
transactions: DataFrame[Transaction]
num_large_transactions: int = count_where(_.total > 1000)
num_small_transactions: int = count_where(_.total < 100)
Don't use expressions in Python resolvers
@features
class User:
id: int
name: str
email: str
name_email_match_score: float
@online
def get_score(
name: User.name,
email: User.email,
) -> User.name_email_match_score:
# DONT DO THIS!!
return F.jaccard_similarity(name, email)
Expressions are not supported in Python resolvers, so you cannot use Chalk functions like F.jaccard_similarity
in a Python resolver. Instead, use expressions to define the feature directly in the feature class.
@features
class User:
id: int
name: str
email: str
name_email_match_score: float = F.jaccard_similarity(
_.name, _.email
)
DataFrame features can be filtered with expressions.
Extending our Transaction
example, we can create a User
feature class with a has-many relationship
to Transaction
. Then, we can define a feature representing the number of large purchases by referencing the existing
User.transactions
feature:
from chalk.features import _, features, DataFrame
@features
class Transaction:
id: int
user_id: "User.id"
total: float
sales_tax: float
subtotal: float = _.total - _.sales_tax
@features
class User:
id: int
# implicit has-many relationship with Transaction due to `user_id` above
transactions: DataFrame[Transaction]
num_large_transactions: int = _.transactions[_.total > 1000].count()
The object referenced by _
changes depending on its current scope. In this code, the _
in _.transactions
references the User
object.
Within the DataFrame
filter, the _
in _.total
references each Transaction
object as each one is evaluated.
The count
aggregation is covered in the next section.
DataFrame features support projection with expressions, which produce a new DataFrame scoped down to the referenced columns. DataFrames can be aggregated after eligible columns are selected.
With our Transaction
example, we already saw a count
aggregation for counting the number of large transactions. We
can add another aggregation for computing the user’s total spend:
from chalk.features import _, features, DataFrame
@features
class Transaction:
id: int
user_id: "User.id"
sales_tax: float
subtotal: float
total: float = _.subtotal + _.sales_tax
@features
class User:
id: int
transactions: DataFrame[Transaction]
num_large_transactions: int = _.transactions[_.total > 1000].count()
total_spend: float = _.transactions[_.total].sum()
To compute User.total_spend
, we needed to create a projection of the User.transactions
DataFrame
limited to only the Transaction.total
column so that the sum
aggregation could work.
In contrast, no projection was needed for the num_large_transactions
aggregation because count
works on DataFrames
with any number of columns.
Use materialized aggregations
For computing low-latency aggregations over high volumes of data, Chalk also offers materialized windowed aggregations that uses materialization of buckets of data to compute large aggregations efficiently.
Aggregation functions have varying behavior when handling None
values and empty DataFrames
.
If an aggregation function says None
values are skipped in the table below,
it will consider a DataFrame
with only None
values as empty.
Function | None values | Empty DataFrame | Notes |
---|---|---|---|
sum | Skipped | Returns 0 | |
min | Skipped | Returns None | |
max | Skipped | Returns None | |
mean | Skipped | Returns None | Feature type must be float or float \| None . None values are skipped, meaning they are not included in the mean calculation. |
count | Included | Returns 0 | |
any | Skipped | Returns False | |
all | Skipped | Returns True | |
std | Skipped | See notes | Standard deviation. Requires at least 2 values. For DataFrames with less than 2 values, returns None . Aliases: stddev , stddev_sample , std_sample . |
var | Skipped | See notes | Variance. Same requirements as std . Alias: var_sample . |
approx_count_distinct | Skipped | Returns 0 | |
approx_percentile | Skipped | Returns None | Takes one argument, quantile , expected to be a float in the range [0, 1] . Example: approx_percentile(0.75) returns a value approximately equal to the 75th percentile of the not-None values in the DataFrame. |
For aggregations that can return None
, either mark the feature as optional
(for example, by setting the feature type to float | None
) or use coalesce
to fall back to a default value.
To specify run conditions such as environment, tags, and versions for a feature
that is resolved through an expression,
you can use the feature
function and pass in the expression as an argument.
from chalk.features import Primary, features, feature
@features
class User:
id: int
purchases: DataFrame[Purchase]
# Uses a default value of 0 when one cannot be computed.
num_purchases: int = feature(
expression=_.purchases.count(),
default=0,
environment=["staging", "dev"],
tags=["fraud", "credit"],
version=1,
)
To test your expressions, we recommend setting up integration tests or iterating on a branch.