Features
Using underscore expressions to define features
Underscore expressions are used to derive features from operations on other features. In feature definitions, _
represents a reference to the containing feature class and is used to access other features in the same instance. In
addition to arithmetic operations, underscore expressions can also be used to filter and aggregate data.
Underscore expressions are useful because Chalk statically analyzes and optimizes them, leading to better performance when compared to equivalent Python resolvers. They are also useful for succinctly defining common features, such as the number of times a user failed a login attempt over the past 30 days or the total amount spent at a given merchant.
In this example, we have a Transaction feature class with total
and sales_tax
features and we want to define
subtotal
as total
minus sales_tax
. Instead of writing a Python resolver, we can resolve subtotal
with an
underscore expression:
from chalk import online
from chalk.features import _, features
@features
class Transaction:
id: int
total: float
sales_tax: float
subtotal: float = _.total - _.sales_tax
@online
def get_subtotal(total: Transaction.total, sales_tax: Transaction.sales_tax) -> Transaction.subtotal:
return total - sales_tax
DataFrame features can be filtered with underscore expressions.
Extending our Transaction
example, we can create a User
feature class with a has-many relationship
to Transaction
. Then, we can define a feature representing the number of large purchases by referencing the existing
User.transactions
feature:
from chalk.features import _, features, DataFrame
@features
class Transaction:
id: int
user_id: "User.id"
total: float
sales_tax: float
subtotal: float = _.total - _.sales_tax
@features
class User:
id: int
# implicit has-many relationship with Transaction due to `user_id` above
transactions: DataFrame[Transaction]
num_large_transactions: int = _.transactions[_.total > 1000].count()
The object referenced by _
changes depending on its current scope. In this code, the _
in _.transactions
references the User
object. Within the DataFrame filter, the _
in _.total
references each Transaction
object as
each one is evaluated. The count
aggregation is covered in the next section.
DataFrame features support projection with underscore expressions, which produce a new DataFrame scoped down to the referenced columns. DataFrames can be aggregated after eligible columns are selected.
With our Transaction
example, we already saw a count
aggregation for counting the number of large transactions. We
can add another aggregation for computing the user’s total spend:
from chalk.features import _, features, DataFrame
@features
class Transaction:
id: int
user_id: "User.id"
total: float
sales_tax: float
subtotal: float = _.total - _.sales_tax
@features
class User:
id: int
# implicit has-many relationship with Transaction due to `user_id` above
transactions: DataFrame[Transaction]
num_large_transactions: int = _.transactions[_.total > 1000].count()
total_spend: float = _.transactions[_.total].sum()
To compute User.total_spend
, we needed to create a projection of the User.transactions
DataFrame limited to only the
Transaction.total
column so that the sum
aggregation could work. In contrast, no projection was needed for
num_large_transaction
’s count
aggregation because count
works on DataFrames with any number of columns.
+
-
*
/
>
>=
<
<=
==
!=
&
|
Do not use Python’s and
, or
, not
, or is
operators in underscore expressions. Python does not allow these
operators to be overridden, so they will not work with Chalk’s underscore expressions.
Aggregation functions have varying behavior when handling None values and empty DataFrames. If an aggregation function says None values are skipped in the table below, it will consider a DataFrame with only None values as empty.
Function | None values | Empty DataFrame | Notes |
---|---|---|---|
sum | Skipped | Returns 0 | |
min | Skipped | Returns None | |
max | Skipped | Returns None | |
mean | Skipped | Returns None | Feature type must be float or float | None . None values are skipped, meaning they are not included in the mean calculation. |
count | Included | Returns 0 | |
any | Skipped | Returns False | |
all | Skipped | Returns True | |
std | Skipped | See notes | Standard deviation. Requires at least 2 values. For DataFrames with less than 2 values, returns None .Aliases: stddev , stddev_sample , std_sample . |
var | Skipped | See notes | Variance. Same requirements as std .Alias: var_sample . |
approx_count_distinct | Skipped | Returns 0 | |
approx_percentile | Skipped | Returns None | Takes one argument, quantile , expected to be a float in the range [0, 1] .Example: approx_percentile(0.75) returns a value approximately equal to the 75th percentile of the not-None values in the DataFrame. |
For aggregations that can return None
, either mark the feature as optional (for example, by setting the feature type
to float | None
) or use coalesce
to fall back to a default value.
The chalk.functions
module exposes several helpful functions that can be used in
combination with underscore references to transform features:
ends_with
levenshtein_distance
like
lower
regexp_extract
regexp_extract_all
regexp_like
replace
reverse
split_part
starts_with
substr
upper
trim
day_of_month
day_of_week
day_of_year
hour_of_day
month_of_year
total_seconds
unix_milliseconds
unix_seconds
week_of_year