Defining the features that we want to compute.
If you want to skip ahead, you can find the full source code for this tutorial on GitHub.
In this example, we’ll consider a fintech use-case where we want to detect fraudulent credit card purchases. Our data consists of a list of credit card transactions, each with a timestamp, a location, and a purchase amount. We also have information about the cardholder and the accounts that the card is linked to.
We’ll start by modeling the features we want for our
the users in our system.
We’ll start simple with three
scalar features:
user.id
, user.name
, and user.email
.
First, we’ll create a new file called models.py
where we’ll define a User
class decorated with
@features
.
from chalk.features import features
@features
class User:
id: int
# The name the user provided to us at signup.
# :owner: identity@chalk.ai
# :tags: pii
name: str
# :tags: pii
email: str
Note that at this point, we haven’t defined how to compute these features. We are only thinking about the data that we would like to have.
There are a few things to note here. First, all our feature
classes need to have a unique id
field. By default, this
is the field named id
. However, if you want to use a
different field as the primary key, you can specify it
using the Primary
argument to @features
.
from chalk.features import features
@features
class User:
id: int
user_id: Primary[int]
name: str
email: str
In our features below, we’ve added some comments and annotations to our features. These are optional, but can be useful for documentation and for setting alerting policies. For example, you may wish to send Pagerduty alerts to different teams based on the owner of the related feature.
Any of the comments and tags from the code also show up in the Chalk dashboard, and are indexed for search.
For example, we’ve added a pii
tag
to the name
and email
fields. This means that
these fields will be treated as personally identifiable
information and will be subject to additional
restrictions.
Next up, we’ll define a related feature class to our users.
We’ll call this class Account
and it will represent
a bank account that a user owns.
from chalk.features import features
@features
class Account:
id: int
# The name of the owner of the account.
title: str
# The id of the user that owns this account.
user_id: int
# The balance of the account, in dollars.
balance: float
This should look much like what we did for the User
class.
However, we may want to link these two classes together.
We can do this by adding a user
field to the Account
class.
@features
class Account:
id: int
user_id: int
user_id: User.id
balance: float
# The user that owns this account.
user: User
This denotes that each account has one user, and that the
Account.user_id
and the User.id
are equal and of type int
, as described by Account.user_id
.
Once we’ve defined the relationship on one side of the join, we can define the inverse relationship on the other side without needing to specify the predicate again.
@features
class User:
id: int
name: str
email: str
# The account that this user owns.
account: "Account"
The final feature entity that we’ll define in this tutorial is
for transactions. Each account has many transactions, and each
transaction is linked to a single account. We’ll define the
Transaction
class and link it to our Account
class as follows:
from chalk.features import features
from chalk.features import features, DataFrame
class TransactionStatus(str, Enum):
PENDING = "pending"
CLEARED = "cleared"
FAILED = "failed"
@features
class Transaction:
id: int
# The id of the account that this transaction belongs to, set to a join.
account_id: "Account.id"
# The amount of the transaction, in dollars.
amount: float
# The status of the transaction, defined as an enum above.
status: TransactionStatus
# Because we define the join condition between
# `Transaction` and `Account` below, we don't
# need to repeat it here.
account: "Account"
@features
class User:
id: int
name: str
email: str
# The account that this user owns.
account: "Account"
@features
class Account:
id: int
user_id: User.id
balance: float
user: User
transactions: DataFrame[Transaction]
This is the first time we’re seeing the DataFrame
type.
A Chalk DataFrame
models tabular data in much the same
way that pandas
does. However, there are some key differences that
allow the Chalk DataFrame
to increase type safety and performance.
Like pandas, Chalk’s DataFrame
is a two-dimensional data structure with
rows and columns. You can perform operations like filtering, grouping,
and aggregating on a DataFrame
. However, there are two main differences.
DataFrame
is lazy and can be backed by multiple data sources, where a pandas.DataFrame
executes eagerly in memory.DataFrame[...]
can be used to represent a type of data with pre-defined filters.You can read more about the Chalk DataFrame
in the docs and
API Reference.
You might also notice that we’ve used an Enum
feature here.
Chalk supports many feature types, including
Enum
,
lists and sets,
and @dataclasses
.