Features
Using underscore expressions to define features
Underscore expressions are used to derive features from operations on other features. In feature definitions, _
represents a reference to the containing feature class and is used to access other features in the same instance. In
addition to arithmetic operations, underscore expressions can also be used to filter and aggregate data.
Underscore expressions are useful because Chalk statically analyzes and optimizes them, leading to better performance when compared to equivalent Python resolvers. They are also useful for succinctly defining common features, such as the number of times a user failed a login attempt over the past 30 days or the total amount spent at a given merchant.
Underscore expressions can be used to access nested attributes from other features in a feature class, whether these other features are struct dataclasses, features, DataFrames of features, or JSON’s.
from dataclasses import dataclass
@dataclass
class LatLonLocation:
lat: float
lon: float
@dataclasses.dataclass
class NestedLatLon:
lat: float | None
lon: int | None
foo: List[str] | None
@dataclasses.dataclass
class LatLon:
bar: str | None
lat: float | None
lon: int | None
foo: List[str] | None
nested_latlon: NestedLatLon | None
@features
class StructAttributeAccess:
id: int
latlng: LatLon
bar: str | None = _.latlng.bar
lat: float | None = _.latlng.lat
lon: float | None = _.latlng.lon
foo: List[str] | None = _.latlng.foo
nested_lat: float | None = _.latlng.nested_latlon.lat
nested_lon: int | None = _.latlng.nested_latlon.lon
nested_foo: List[str] | None = _.latlng.nested_latlon.foo
In this example, we have a Transaction feature class with total
and sales_tax
features and we want to define
subtotal
as total
minus sales_tax
. Instead of writing a Python resolver, we can resolve subtotal
with an
underscore expression:
from chalk import online
from chalk.features import _, features
@features
class Transaction:
id: int
total: float
sales_tax: float
subtotal: float = _.total - _.sales_tax
@online
def get_subtotal(total: Transaction.total, sales_tax: Transaction.sales_tax) -> Transaction.subtotal:
return total - sales_tax
DataFrame features can be filtered with underscore expressions.
Extending our Transaction
example, we can create a User
feature class with a has-many relationship
to Transaction
. Then, we can define a feature representing the number of large purchases by referencing the existing
User.transactions
feature:
from chalk.features import _, features, DataFrame
@features
class Transaction:
id: int
user_id: "User.id"
total: float
sales_tax: float
subtotal: float = _.total - _.sales_tax
@features
class User:
id: int
# implicit has-many relationship with Transaction due to `user_id` above
transactions: DataFrame[Transaction]
num_large_transactions: int = _.transactions[_.total > 1000].count()
The object referenced by _
changes depending on its current scope. In this code, the _
in _.transactions
references the User
object. Within the DataFrame filter, the _
in _.total
references each Transaction
object as
each one is evaluated. The count
aggregation is covered in the next section.
DataFrame features support projection with underscore expressions, which produce a new DataFrame scoped down to the referenced columns. DataFrames can be aggregated after eligible columns are selected.
With our Transaction
example, we already saw a count
aggregation for counting the number of large transactions. We
can add another aggregation for computing the user’s total spend:
from chalk.features import _, features, DataFrame
@features
class Transaction:
id: int
user_id: "User.id"
total: float
sales_tax: float
subtotal: float = _.total - _.sales_tax
@features
class User:
id: int
# implicit has-many relationship with Transaction due to `user_id` above
transactions: DataFrame[Transaction]
num_large_transactions: int = _.transactions[_.total > 1000].count()
total_spend: float = _.transactions[_.total].sum()
To compute User.total_spend
, we needed to create a projection of the User.transactions
DataFrame limited to only the
Transaction.total
column so that the sum
aggregation could work. In contrast, no projection was needed for
num_large_transaction
’s count
aggregation because count
works on DataFrames with any number of columns.
+
-
*
/
>
>=
<
<=
==
!=
&
|
Do not use Python’s and
, or
, not
, or is
operators in underscore expressions. Python does not allow these
operators to be overridden, so they will not work with Chalk’s underscore expressions.
Aggregation functions have varying behavior when handling None values and empty DataFrames. If an aggregation function says None values are skipped in the table below, it will consider a DataFrame with only None values as empty.
Function | None values | Empty DataFrame | Notes |
---|---|---|---|
sum | Skipped | Returns 0 | |
min | Skipped | Returns None | |
max | Skipped | Returns None | |
mean | Skipped | Returns None | Feature type must be float or float | None . None values are skipped, meaning they are not included in the mean calculation. |
count | Included | Returns 0 | |
any | Skipped | Returns False | |
all | Skipped | Returns True | |
std | Skipped | See notes | Standard deviation. Requires at least 2 values. For DataFrames with less than 2 values, returns None .Aliases: stddev , stddev_sample , std_sample . |
var | Skipped | See notes | Variance. Same requirements as std .Alias: var_sample . |
approx_count_distinct | Skipped | Returns 0 | |
approx_percentile | Skipped | Returns None | Takes one argument, quantile , expected to be a float in the range [0, 1] .Example: approx_percentile(0.75) returns a value approximately equal to the 75th percentile of the not-None values in the DataFrame. |
For aggregations that can return None
, either mark the feature as optional (for example, by setting the feature type
to float | None
) or use coalesce
to fall back to a default value.
The chalk.functions
module exposes several helpful functions that can be used in
combination with underscore references to transform features:
ends_with
levenshtein_distance
like
lower
regexp_extract
regexp_extract_all
regexp_like
regexp_replace
replace
reverse
split_part
starts_with
substr
strpos
upper
trim
bytes_to_string
gunzip
json_extract_array
json_value
md5
sha1
sha256
sha512
spooky_hash_v2_32
spooky_hash_v2_64
string_to_bytes
day_of_month
day_of_week
day_of_year
format_datetime
from_unix_milliseconds
from_unix_seconds
hour_of_day
is_us_federal_holiday
month_of_year
to_iso8601
total_seconds
unix_milliseconds
unix_seconds
week_of_year
array_agg
array_count_value
array_distinct
array_join
array_max
array_min
array_sort
cardinality
head
element_at
max
max_by
min
min_by
slice
contains