Expressions

Overview

Chalk expressions are used to derive features from operations on other features. In feature definitions, _ represents a reference to the containing feature class and is used to access other features in the same instance. In addition to arithmetic operations, expressions can also be used to filter and aggregate data.

Expressions are useful because Chalk statically analyzes and optimizes them, leading to better performance when compared to equivalent Python resolvers. They are also useful for succinctly defining common features, such as the number of times a user failed a login attempt over the past 30 days or the total amount spent at a given merchant.

Nested References

Expressions can be used to access nested attributes from other features in a feature class, whether these other features are struct dataclasses, features, DataFrames of features, or JSON’s.

from dataclasses import dataclass

@dataclass
class LatLonLocation:
    lat: float
    lon: float

@dataclasses.dataclass
class NestedLatLon:
    lat: float | None
    lon: int | None
    foo: List[str] | None


@dataclasses.dataclass
class LatLon:
    bar: str | None
    lat: float | None
    lon: int | None
    foo: List[str] | None
    nested_latlon: NestedLatLon | None


@features
class StructAttributeAccess:
    id: int
    latlng: LatLon
    bar: str | None = _.latlng.bar
    lat: float | None = _.latlng.lat
    lon: float | None = _.latlng.lon
    foo: List[str] | None = _.latlng.foo
    nested_lat: float | None = _.latlng.nested_latlon.lat
    nested_lon: int | None = _.latlng.nested_latlon.lon
    nested_foo: List[str] | None = _.latlng.nested_latlon.foo

Arithmetic

In this example, we have a Transaction feature class with total and sales_tax features and we want to define subtotal as total minus sales_tax. Instead of writing a Python resolver, we can resolve subtotal with an expression:

from chalk import online
from chalk.features import _, features

@features
class Transaction:
  id: int
  total: float
  sales_tax: float
  subtotal: float = _.total - _.sales_tax

@online
def get_subtotal(total: Transaction.total, sales_tax: Transaction.sales_tax) -> Transaction.subtotal:
    return total - sales_tax

Conditions and filters

DataFrame features can be filtered with expressions.

Extending our Transaction example, we can create a User feature class with a has-many relationship to Transaction. Then, we can define a feature representing the number of large purchases by referencing the existing User.transactions feature:

from chalk.features import _, features, DataFrame

@features
class Transaction:
  id: int
  user_id: "User.id"
  total: float
  sales_tax: float
  subtotal: float = _.total - _.sales_tax

@features
class User:
   id: int
   # implicit has-many relationship with Transaction due to `user_id` above
   transactions: DataFrame[Transaction]
   num_large_transactions: int = _.transactions[_.total > 1000].count()

The object referenced by _ changes depending on its current scope. In this code, the _ in _.transactions references the User object. Within the DataFrame filter, the _ in _.total references each Transaction object as each one is evaluated. The count aggregation is covered in the next section.

Projections and aggregations

DataFrame features support projection with expressions, which produce a new DataFrame scoped down to the referenced columns. DataFrames can be aggregated after eligible columns are selected.

With our Transaction example, we already saw a count aggregation for counting the number of large transactions. We can add another aggregation for computing the user’s total spend:

from chalk.features import _, features, DataFrame

@features
class Transaction:
  id: int
  user_id: "User.id"
  total: float
  sales_tax: float
  subtotal: float = _.total - _.sales_tax

@features
class User:
  id: int
  # implicit has-many relationship with Transaction due to `user_id` above
  transactions: DataFrame[Transaction]
  num_large_transactions: int = _.transactions[_.total > 1000].count()
  total_spend: float = _.transactions[_.total].sum()

To compute User.total_spend, we needed to create a projection of the User.transactions DataFrame limited to only the Transaction.total column so that the sum aggregation could work. In contrast, no projection was needed for num_large_transaction’s count aggregation because count works on DataFrames with any number of columns.

All supported operations

Arithmetic

Addition: +
Subtraction: -
Multiplication: *
Division: /

Conditions

Greater than: >
Greater than or equal: >=
Less than: <
Less than or equal: <=
Equal: ==
Not equal: !=
Boolean and: &
Boolean or: |
Boolean not: ~

Don't use `or`, `and`, `not`, and `is` in expressions

Do not use Python's `and`, `or`, `not`, or `is` operators in expressions. Python does not allow these operators to be overridden, so they will not work with Chalk's expressions.

Aggregations

Aggregation functions have varying behavior when handling None values and empty DataFrames. If an aggregation function says None values are skipped in the table below, it will consider a DataFrame with only None values as empty.

Function	`None` values	Empty DataFrame	Notes
`sum`	Skipped	Returns `0`
`min`	Skipped	Returns `None`
`max`	Skipped	Returns `None`
`mean`	Skipped	Returns `None`	Feature type must be `float` or `float \| None`. `None` values are skipped, meaning they are not included in the mean calculation.
`count`	Included	Returns `0`
`any`	Skipped	Returns `False`
`all`	Skipped	Returns `True`
`std`	Skipped	See notes	Standard deviation. Requires at least 2 values. For DataFrames with less than 2 values, returns `None`. Aliases: `stddev`, `stddev_sample`, `std_sample`.
`var`	Skipped	See notes	Variance. Same requirements as `std`. Alias: `var_sample`.
`approx_count_distinct`	Skipped	Returns `0`
`approx_percentile`	Skipped	Returns `None`	Takes one argument, `quantile`, expected to be a float in the range `[0, 1]`. Example: `approx_percentile(0.75)` returns a value approximately equal to the 75th percentile of the not-`None` values in the DataFrame.

For aggregations that can return None, either mark the feature as optional (for example, by setting the feature type to float | None) or use coalesce to fall back to a default value.

Additional functions

The chalk.functions module exposes several helpful functions that can be used in combination with expression references to transform features:

Overview

Nested References

Arithmetic

Conditions and filters

Projections and aggregations

All supported operations

Arithmetic

Conditions

Aggregations

Additional functions

Logic

String Manipulation

Encoding and decoding

Math

Datetime Manipulation

URL Manipulation

Array Manipulation

Other

On this page

​Overview

​Nested References

​Arithmetic

​Conditions and filters

​Projections and aggregations

​All supported operations

​Arithmetic

​Conditions

​Aggregations

​Additional functions

​Logic

​String Manipulation

​Encoding and decoding

​Math

​Datetime Manipulation

​URL Manipulation

​Array Manipulation

​Other

On this page

Overview

Nested References

Arithmetic

Conditions and filters

Projections and aggregations

All supported operations

Arithmetic

Conditions

Aggregations

Additional functions

Logic

String Manipulation

Encoding and decoding

Math

Datetime Manipulation

URL Manipulation

Array Manipulation

Other