Expressions - Chalk

Overview

Chalk Expressions let you define features declaratively, using symbolic computation over your data. While you write expressions in idiomatic Python, they are compiled and executed as vectorized C++, enabling low-latency computation at serve time and high-throughput processing at train time.

Expressions support a wide range of operations, including arithmetic, filtering, aggregations, and built-in functions like .

For example, in a Transaction feature class, we can compute the subtotal of a transaction as the difference between its total and sales tax:

from chalk.features import features, Primary
from chalk import _

@features
class Transaction:
    id: int
    total: float
    sales_tax: float
    subtotal: float = _.total - _.sales_tax

The _ symbol refers to the current scope (here, the feature class Transaction) and is used to reference other fields on the same instance. Expressions like _.total - _.sales_tax are compiled into native execution plans that run efficiently in production.

In addition to referencing fields on the same object, you can traverse relationships. If each Transaction is associated with a User, for example, you can compute a string similarity between the user’s name and the transaction memo:

from chalk.features import _, features
import chalk.functions as F

@features
class Transaction:
    ...
    amount: float
    memo: str
    user: "User"
    name_match_score: float = F.jaccard_similarity(
      _.user.name, _.memo
    )

Here, _.user.name follows the foreign key relationship from Transaction to User. The function F.jaccard_similarity is one of many built-in Chalk functions that operate on symbolic expressions.

Expressions can also perform aggregations over related records. In a User feature class, we can compute aggregates like the number of large transactions or the total amount spent:

from chalk import _
from chalk.features import DataFrame, features

@features
class User:
    id: int
    name: str
    transactions: DataFrame[Transaction]

    # Total spend is nullable because the sum of an empty DataFrame is null
    total_spend: float | None = _.transactions[_.total].sum()

    # The count is never null, because an empty DataFrame has count 0
    num_large_txns: int = _.transactions[_.total > 1000].count()

In this context, _ refers to the User instance when referring to _.transactions. But when you apply a filter, like _.transactions[_.total > 1000], the expression inside the brackets is evaluated in the context of each individual Transaction. That means _.total refers to the total field on each Transaction, not on the User. This scoped evaluation makes it easy to filter, project, and aggregate over related data.

All expressions are statically analyzed, optimized to eliminate redundant computation, and executed as high-performance C++ at runtime.

Scalar Functions

Chalk expressions support a wide range of built-in functions for manipulating data, performing calculations, and transforming features. These functions can be used in expressions to operate on feature values, DataFrames, and other data types.

from chalk import features, online
from chalk.features import _, features

@features
class Transaction:
  id: int
  total: float
  sales_tax: float
  subtotal: float = _.total - _.sales_tax

@online
def get_subtotal(total: Transaction.total, sales_tax: Transaction.sales_tax) -> Transaction.subtotal:
    return total - sales_tax

Infix Operators

Chalk expressions support a variety of infix operators for arithmetic, conditions, and boolean logic.

Operator	Description	Example
`+`	Addition	`_.total + _.sales_tax`
`-`	Subtraction	`_.total - _.sales_tax`
`*`	Multiplication	`_.quantity * _.price`
`/`	Division	`_.total / _.quantity`
`>`	Greater than	`_.total > 1000`
`>=`	Greater than or equal	`_.total >= 1000`
`<`	Less than	`_.total < 1000`
`<=`	Less than or equal	`_.total <= 1000`
`==`	Equal	`_.status == "completed"`
`!=`	Not equal	`_.status != "completed"`
`&`	Boolean and	`_.is_active & _.is_verified`
`\|`	Boolean or	`_.is_active \| _.is_verified`
`~`	Boolean not	`~_.is_active`

Do not use Python's and, or, not, or is operators in expressions.

Python does not allow these operators to be overridden, so they will not work with Chalk's expressions. Instead, use the infix operators &, |, and ~ for boolean logic, and use == and != for equality comparisons.

Builtin Functions

The chalk.functions module exposes several helpful functions that can be used in combination with expression references to transform features. These functions are meant to be used in expressions and are not available as standalone functions. To view all available functions, see our SDK docs.

Structs

Expressions can be used to access nested attributes from other features in a feature class, whether these other features are struct dataclasses, features, or DataFrames.

import chalk.functions as F
from chalk.features import features
from dataclasses import dataclass

@dataclass
class LatLon:
    lat: float | None
    lon: int | None

@features
class User:
    id: int
    home: LatLon
    work: LatLon
    commute_distance: float = F.haversine(
        lat1=_.home.lat,
        lon1=_.home.lon,
        lat2=_.work.lat,
        lon2=_.work.lon,
    )

Custom Functions

You can create custom functions to encapsulate complex logic or reusable computations in your expressions. For example, if you wanted to apply consistent windows across many features, you could define a custom function like this:

Use helper functions to create expressions

from chalk import _
from chalk.features import features, DataFrame

def count_where(*filters):
    return _.transactions[_.created_at > _.chalk_window, *filters].count()

@features
class User:
    id: int
    transactions: DataFrame[Transaction]
    num_large_transactions: int = count_where(_.total > 1000)
    num_small_transactions: int = count_where(_.total < 100)

Expressions are not supported in Python resolvers, so you cannot use Chalk functions like F.jaccard_similarity in a Python resolver.

Don't use expressions in Python resolvers

@features
class User:
    id: int
    name: str
    email: str
    name_email_match_score: float

@online
def get_score(
    name: User.name,
    email: User.email,
) -> User.name_email_match_score:
    # Don't do this!! Expressions don't run in Python resolvers
    return F.jaccard_similarity(name, email)

Instead, use expressions to define the feature directly in the feature class:

Use expressions in feature classes

@features
class User:
    id: int
    name: str
    email: str
    name_email_match_score: float = F.jaccard_similarity(
        _.name, _.email
    )

DataFrame Functions

Conditions and filters

DataFrame features can be filtered with expressions.

Extending our Transaction example, we can create a User feature class with a has-many relationship to Transaction. Then, we can define a feature representing the number of large purchases by referencing the existing User.transactions feature:

from chalk.features import _, features, DataFrame

@features
class Transaction:
    id: int
    user_id: "User.id"
    total: float
    sales_tax: float
    subtotal: float = _.total - _.sales_tax

@features
class User:
   id: int
   # implicit has-many relationship with Transaction due to `user_id` above
   transactions: DataFrame[Transaction]
   num_large_transactions: int = _.transactions[_.total > 1000].count()

The object referenced by _ changes depending on its current scope. In this code, the _ in _.transactions references the User object. Within the DataFrame filter, the _ in _.total references each Transaction object as each one is evaluated. The count aggregation is covered in the next section.

Projections and aggregations

DataFrame features support projection with expressions, which produce a new DataFrame scoped down to the referenced columns. DataFrames can be aggregated after eligible columns are selected.

With our Transaction example, we already saw a count aggregation for counting the number of large transactions. We can add another aggregation for computing the user’s total spend:

from chalk.features import _, features, DataFrame

@features
class Transaction:
  id: int
  user_id: "User.id"
  sales_tax: float
  subtotal: float
  total: float = _.subtotal + _.sales_tax

@features
class User:
  id: int
  transactions: DataFrame[Transaction]
  num_large_transactions: int = _.transactions[_.total > 1000].count()
  total_spend: float = _.transactions[_.total].sum()

To compute User.total_spend, we needed to create a projection of the User.transactions DataFrame limited to only the Transaction.total column so that the sum aggregation could work. In contrast, no projection was needed for the num_large_transactions aggregation because count works on DataFrames with any number of columns.

Use materialized aggregations

For computing low-latency aggregations over high volumes of data, Chalk also offers [materialized windowed aggregations](/docs/materialized_aggregations) that uses materialization of buckets of data to compute large aggregations efficiently.

Aggregation functions

Aggregation functions have varying behavior when handling None values and empty DataFrames. If an aggregation function says None values are skipped in the table below, it will consider a DataFrame with only None values as empty.

Function	`None` values	Empty DataFrame	Notes
`sum`	Skipped	Returns `0`
`min`	Skipped	Returns `None`
`max`	Skipped	Returns `None`
`mean`	Skipped	Returns `None`	Feature type must be `float` or `float \\| None`. `None` values are skipped, meaning they are not included in the mean calculation.
`count`	Included	Returns `0`
`any`	Skipped	Returns `False`
`all`	Skipped	Returns `True`
`std`	Skipped	See notes	Standard deviation. Requires at least 2 values. For DataFrames with less than 2 values, returns `None`. Aliases: `stddev`, `stddev_sample`, `std_sample`.
`var`	Skipped	See notes	Variance. Same requirements as `std`. Alias: `var_sample`.
`approx_count_distinct`	Skipped	Returns `0`
`approx_percentile`	Skipped	Returns `None`	Takes one argument, `quantile`, expected to be a float in the range `[0, 1]`. Example: `approx_percentile(0.75)` returns a value approximately equal to the 75th percentile of the not-`None` values in the DataFrame.
`approx_top_k`	Skipped	Returns `None`	Takes one argument, `k`, expected to be a positive integer, as a keyword argument. Example: `approx_top_k(k=25)` returns the 25 or fewer approximately-most common values.

For aggregations that can return None, either mark the feature as optional (for example, by setting the feature type to float | None) or use coalesce to fall back to a default value.

Run conditions

To specify run conditions such as environment, tags, and versions for a feature that is resolved through an expression, you can use the feature function and pass in the expression as an argument.

from chalk.features import Primary, features, feature

@features
class User:
    id: int

    purchases: DataFrame[Purchase]
    # Uses a default value of 0 when one cannot be computed.
    num_purchases: int = feature(
        expression=_.purchases.count(),
        default=0,
        environment=["staging", "dev"],
        tags=["fraud", "credit"],
        version=1,
    )

Testing

To test your expressions, we recommend setting up integration tests or iterating on a branch.

Dynamic Expressions

In some cases, you may want to build expressions dynamically. For example, if you have a rules engine that generates expressions and stores them in a database, you can load those expressions at runtime and ask Chalk to compute their values.

Consider this User and Transaction model, which would be checked in to the code:

from chalk.features import features, DataFrame

@features
class Transaction:
    id: int
    user_id: "User.id"
    user: "User"
    total: float
    sales_tax: float

@features
class User:
    id: int
    transactions: DataFrame[Transaction]
    name: str
    email: str

Using the Golang or Python SDKs, you can compute both scalar and aggregate features on demand.

For example, in Golang, if we wanted to compare the lowercased name and email by their Jaccard similarity, we could build the expression dynamically and ask Chalk to compute it:

package main

import (
	"testing"
	"github.com/zeebo/assert"
	"github.com/chalk-ai/chalk-go/expr"
	"github.com/chalk-ai/chalk-go"
)

func TestQueryingExpressions(t *testing.T) {
	// Picks up the ambient credentials from the `chalk login` run on the CLI.
	client, err := chalk.NewGRPCClient(t.Context())
	assert.NoError(t, err)
	result, err := client.OnlineQueryBulk(
		t.Context(),
		chalk.OnlineQueryParams{}.
			WithInput("user.id", []int{1}).
			WithOutputs("user.name").
			WithOutputs("user.email").
			WithOutputExprs(
				expr.FunctionCall(
					"jaccard_similarity",
					expr.FunctionCall("lower", expr.Col("name")),
					expr.FunctionCall("lower", expr.Col("email")),
				).
					As("user.name_email_sim"),
			),
	)
	assert.NoError(t, err)
	row, err := result.GetRow(0)
	assert.NoError(t, err)
	for feature, value := range row.Features {
		t.Logf("Feature: %s, Value: %+v", feature, value.Value)
	}
}

This program produces output like this:

=== RUN   TestChalkClient
    chalk_test.go:46: Feature: user.name, Value: Nicole Mann
    chalk_test.go:46: Feature: user.email, Value: nicoleam@nasa.gov
    chalk_test.go:46: Feature: user.name_email_sim, Value: 0.5714285714285714
--- PASS: TestChalkClient (0.43s)

The feature user.name_email_sim was computed using the expression jaccard_similarity(lower(name), lower(email)), which was built dynamically using the expr package.

For a complete list of functions, see our SDK docs. All the functions available in expressions are also available in the expr package.

You can also compute aggregations dynamically. Using the Golang SDK, we can compute the number of large transactions for a user:

result, err := client.OnlineQueryBulk(
    t.Context(),
    chalk.OnlineQueryParams{}.
        WithInput("user.id", []int{1}).
        WithOutputExprs(
            expr.DataFrame("transactions").
                Filter(expr.Col("amount").Gt(expr.Float(0.))).
                Agg("count").
                As("user.positive_transaction_count"),
        ),
)

For this program, the output would look like this:

=== RUN   TestChalkClient
    chalk_test.go:42: Feature: user.positive_transaction_count, Value: 33
--- PASS: TestChalkClient (0.45s)

In a Python expression, the above SDK call is equivalent to:

User.positive_transaction_count = _.transactions[_.amount > 0].count()

Both the scalar functions and aggregations can be computed in Python SDK. See this guide for more details.

​Overview

​Scalar Functions

​Infix Operators

​Builtin Functions

​Structs

​Custom Functions

​DataFrame Functions

​Conditions and filters

​Projections and aggregations

​Aggregation functions

​Run conditions

​Testing

​Dynamic Expressions

On this page