Tutorial: Building a fraud pipeline

Introduction

Chalk helps you build out feature pipelines for training and serving machine learning models.

The building blocks of Chalk are features and resolvers. Features specify what you want your data to look like and resolvers tell Chalk where to find your “data”.

In Chalk, features are defined as Python classes with annotated attributes. Note how the User class is annotated with the @features decorator that’s imported from chalk.features.

from chalk.features import features
import datetime as dt

@features
class User:
    id: int
    birthday: dt.datetime
    name: str
    email: str

    # computed features
    username: str
    age: int
    is_adult: bool

Features are the abstract underlying attributes of your data. In the example above `id`, `birthday`, `name`, `email`, `username`, `age`, and `is_adult` are all features. Features are grouped together (namespaced) in what are called feature classes. For instance, the `User` class is a **feature class** (also sometimes referred to as a **feature set**). Specific instances of a feature class are **feature class instances**.

Note, you define your features before defining how they will be computed. Some of these features might come directly from a database table: for instance, the id, birthday, and name fields that we’ve defined above. Some might be calculated in real time based on the current timestamp (like the age feature) or a different upstream feature (like the is_adult feature, which will depend on the age feature).

You could deploy the code above to Chalk and it would be considered valid. However, if you tried to query for a field (for example, if you asked chalk to return the name of a user with id 1) then you’d get an error.

In particular, you would get the following error:

$ chalk query --in user.id=1 --out user.name
Results

No scalar features


Errors

Resolver Not Found user.age

Failed to find any valid resolver for feature: 'user.name' and no `default` value or
`max_staleness` was specified for 'user.name', so the feature cannot be defaulted or
resolved from the online store.

We'll talk more about deploying code to Chalk later, but, if you'd like to, you're welcome to skip ahead to get a better sense of [deploying code to Chalk] and [running queries].

This should not be surprising. Chalk has no connection to your underlying data. So it doesn’t know how to get the name of the user with an id of 1 (or any id for that matter).

In particular, the error message tells us that a “Resolver” was not found. A resolver is a function that takes in features and outputs features. Broadly, there are two types of resolvers: root and non-root. Root resolvers don’t take any features (or take only a primary key) as their input and are used to fetch data from external sources.

In the User example we sketched out above, you could think of a root resolver as a SQL query that reads from a users table in an external database.

While there are exceptions, this is what the majority of root resolvers look like. For example, to read from the users table in your PostgreSQL database, you might write a get_users.chalk.sql file, with the following contents:

-- source: postgres
-- resolves: User

select id, birthday, email, name from users

By writing this file, you’ve defined a SQL resolver, named get_users, which knows how to resolve birthday, name, or email for any given id in the users table. At this point, you might have noticed that the SQL query above doesn’t explicitly know how to get the birthday or name of a given id, it just returns all users. Chalk gets around this by pushing filters into your SQL queries.

Let’s say you’ve defined the SQL resolver above and deployed it to Chalk. You rerun the query that failed above:

$ chalk query --in user.id=1 --out user.name

Chalk will look at the resolvers you’ve defined and ask “given a User.id, can I calculate a User.name”. It will see the SQL resolver defined above and realize that it can be used to get User.name from a User.id. It will push the id filter into the SQL query, running.

select name from users
where id=1

In addition to the filter a projection will be pushed into your SQL query: only the name column is requested. When running a query, Chalk processes the minimum amount of data required to calculate a desired output features.

Getting this to work fully, will involve setting up a database connector to Chalk and deploying code, both of which we will cover later. For now, it is important to note that the process of defining resolvers is incredibly modular.

Now lets look into defining a non-root resolver, which computes a User’s email username.

In Chalk, non-root resolvers take one of two forms:

Python functions with feature annotations,
inline expressions.

We will look at both variants, but functionally they are similar. Both take input features and generate output features.

from chalk import online
from chalk.features import features
import datetime as dt

@features
class User:
    id: int
    birthday: dt.datetime
    name: str
    email: str

    # computed features
    username: str
    age: int
    is_adult: bool

@online
def get_email_username(email: User.email) -> User.username:
    return email.split('@')[0]

As mentioned above, this could also be written as an inline expression, like so:

from chalk import online
from chalk.features import features
import datetime as dt

@features
class User:
    id: int
    birthday: dt.datetime
    name: str
    email: str

    # computed features
    username: str = F.split_part(_.email, delimiter="@", index=0)
    age: int
    is_adult: bool

The advantage of inline expressions is that they are statically compiled into the execution engine, which often provides performance advantages. Chalk also tries to statically compile Python resolvers by parsing the AST of the function!

By writing the get_email_username Python function, we’ve defined another resolver. If we deploy this new code to Chalk, we can now query for the username of a user with a given id.

$ chalk query --in user.id=1 --out user.username

When running the above query, Chalk will identify that it knows how to determine an email from a given id using the get_user resolver. It will also determine that it knows how to calculate a username for an email using the get_email_username resolver. Chalk will execute both these resolvers and return the result.

The focus on data instead of pipelines may be unfamiliar at first. Traditional orchestration platforms like Airflow or Dagster explicitly compose functions which produce data into a DAG of tasks. With Chalk, the DAG of resolvers is defined implicitly by the features they produce. This architecture makes it easy to build out feature pipelines that are reusable and composable. Chalk handles tracking your features for temporal consistency, running your resolvers in parallel, and horizontally scaling your feature pipelines.

Alright, enough with the intro. Let’s get started building!

Fraud Detection Pipeline

This tutorial walks through the process of building a feature pipeline for fraud detection. We’ll cover the full feature development lifecycle:

Data Modeling - Creating feature classes for the data we want to compute.
SQL Resolvers - Mapping data from SQL sources to feature classes.
Python Resolvers - Defining resolvers in Python that compute derived features and call external APIs.
Inference - Integrating Chalk into production decisioning systems.
Backtesting - Experimenting with new features

In this tutorial, we’ll assume the following existing data architecture:

a PostgreSQL database with three tables: users, accounts, and transactions.
a Snowflake analytics database with the same three tables: users, accounts, and transactions, which is periodically updated in response to changes in the PostgreSQL database.

Before you get started, make sure you have the Chalk CLI installed.

If you want to skip ahead, you can find the full source code for this tutorial on GitHub.

Define features

We’ll start by modeling out the Users feature class.

We’ll start simple with three scalar features: User.id, User.name, and User.email. First, we’ll create a new file called feature_sets.py where we’ll define a User class decorated with @features.

src/feature_sets.py

from chalk.features import features

@features
class User:
    id: int

    # The name the user provided to us at signup.
    # :owner: identity@chalk.ai
    # :tags: pii
    name: str

    # :tags: pii
    email: str

Note that at this point we haven’t defined how to compute these features. We are only thinking about the data that we would like to have.

Primary keys

There are a few things to note here. First, all our feature classes need to have a unique id field. By default, this is the field named id. However, if you want to use a different feature name as the primary key, you can specify it by annotating the primary key feature with the Primary type.

src/feature_sets.py

from chalk.features import features, Primary

@features
class User:
  id: int
  user_id: Primary[int]
  name: str
  email: str

Tags, Descriptions, and Owners

In our features above, we’ve added some comments and annotations. These are optional, but can be useful for documentation and for setting alerting policies. For example, you may wish to send Pagerduty alerts to different teams based on the owner of the related feature.

All of the comments and tags from the code also show up in the Chalk dashboard, and are indexed for search.

For example, we’ve added a pii tag to the name and email fields. This means that these fields will be treated as personally identifiable information and will be subject to additional restrictions.

Has-One Relationships

Next up, we’ll define a feature class related to our User. We’ll call this class Account and it will represent a user’s bank account.

src/feature_sets.py

from chalk.features import features

@features
class Account:
    id: int

    # The name of the owner of the account.
    title: str

    # The id of the user that owns this account.
    user_id: int

    # The balance of the account, in dollars.
    balance: float

This should look much like what we did for the User class. However, we may want to link these two classes together. We can do this by adding a user field to the Account class.

src/feature_sets.py

@features
class Account:
  id: int
  user_id: int
  user_id: User.id
  balance: float

  # The user that owns this account.
  user: User

This denotes that each account has one user, and that the Account.user_id and the User.id are equal and of type int, as described by User.id.

Once we’ve defined the relationship on one side of the join, we can define the inverse relationship on the other side without needing to specify the foreign key again.

src/feature_sets.py

@features
class User:
  id: int
  name: str
  email: str

  # The account that this user owns.
  account: "Account"

Note that the account annotation in the User feature class is in quotes. Because the Account class is defined later in the file, Chalk will recognize this as a valid feature reference and process it correctly.

Has-Many Relationships

The final feature entity that we’ll define in this tutorial is for transactions. Each account has many transactions and each transaction is linked to a single account. We’ll define the Transaction class and link it to our Account class as follows:

src/feature_sets.py

from enum import Enum
from chalk.features import features
from chalk.features import features, DataFrame, FeatureTime

class TransactionStatus(str, Enum):
    PENDING = "pending"
    CLEARED = "cleared"
    FAILED = "failed"

@features
class Transaction:
   id: int

   # The id of the account that this transaction belongs to, set to a join.
   # We refer to features and feature classes defined further down in the file
   # using quotation marks, so Chalk will recognize that it is a valid
   # feature reference to be processed later.
   account_id: "Account.id"

   # The amount of the transaction, in dollars.
   amount: float

   # The status of the transaction, defined as an enum above.
   status: TransactionStatus

   # When the transaction occurred
   created_at: FeatureTime

   # Because we define the join condition between
   # `Transaction` and `Account` below, we don't
   # need to repeat it here.
   account: "Account"

@features
class User:
  id: int
  name: str
  email: str

  # The account that this user owns.
  account: "Account"

@features
class Account:
  id: int
  user_id: User.id
  balance: float
  user: User
  transactions: DataFrame[Transaction]

This is the first time we’re seeing the DataFrame type.

A Chalk DataFrame models tabular data in much the same way that Pandas does. However, there are some key differences that allow the Chalk DataFrame to increase type safety and performance.

Like pandas, Chalk’s DataFrame is a two-dimensional data structure with rows and columns. You can perform operations like filtering, grouping, and aggregating on a DataFrame. However, there are two main differences.

Lazy implementation - Chalk’s DataFrame is lazy and can be backed by multiple data sources, where a pandas.DataFrame executes eagerly in memory.
Usable as a type - Chalk’s DataFrame[...] can be used to represent a type of data with pre-defined filters.

You might also notice that we used an Enum feature. Chalk supports many feature types, including Enums, lists/sets, and dataclasses.

We also added a created_at field to the Transaction class of type FeatureTime. FeatureTime is a special annotation that can only be assigned to a single feature in a feature class. This feature specifies the logical time of a feature instance—it is used for point in time correctness and incremental updating. We’ll cover both later in this tutorial.

Configuring SQL sources

The primary source of data for most companies are SQL databases (note, by SQL database we don’t mean relational databases: we mean any database that can be queried with SQL—in practice this means pretty much any of them). Chalk can automatically ingest data from SQL databases and map results into feature classes.

In our example application, we have two databases: PostgreSQL and Snowflake. Our PostgreSQL database is the primary database used in our codebase, and our Snowflake database is used for analytics, with tables populated from DBT views and batch jobs.

To configure your SQL sources in Chalk, we’ll: 1). Add your data sources to the Chalk dashboard 2). Define the data sources in your Chalk code.

Adding Data Sources in the Chalk Dashboard

To add your data sources to the Chalk dashboard, go to the Data Sources tab of the dashboard and click on the Add a data source button. You can then select the type of data source you want to add.

Select PostgreSQL and fill in the required information. Repeat the same process with a Snowflake data source. Keep track of the names that you’ve given to each of your data sources: these will be used in the next step. We recommend naming your PostgreSQL data source postgres and your Snowflake data source snowflake.

Note, the data source connection information you add in the dashboard will be stored securely using the secrets manager of your cloud provider.

Define Your Data Sources in Your Chalk Code

To define your data sources, you’ll need to create a datasources.py file that contains a SnowflakeSource and a PostgreSQLSource:

src/datasources.py

from chalk.sql import SnowflakeSource, PostgreSQLSource

# if you named your postgres or snowflake resolvers something other than `postgres`
# and `snowflake`, you'll have to update the code accordingly.
snowflake = SnowflakeSource(name="postgres")
postgres = PostgreSQLSource(name="snowflake")

Now that we’ve defined our data sources, we can use them to create root SQL resolvers.

—

Online data

Chalk’s preferred way to ingest data from SQL databases is to use SQL file resolvers.

With SQL file resolvers, you can write your root resolvers in SQL, and use whatever database tooling you’re familiar with to test, lint, and debug your code.

To create a SQL resolver, add a file to your project directory with the extension .chalk.sql. You can now write a SQL query in this file. You will also need to add metadata to the top of the file to tell Chalk how to ingest the data.

For example, say that we want to resolve the name and email features of our User feature class from a PostgreSQL table.

To do this, we can write the following SQL file resolver:

src/get_user.chalk.sql

-- get users from postgres.
-- resolves: User
-- source: postgres
select
    id,
    full_name as name,
    email
from users

The resolves key (--resolves: User) tells Chalk which feature class the columns in the select statement should be mapped into. Then, the target names of the query are compared against the names of the features on the specified feature class. If the names match after stripping underscores and lower-casing, the select target is mapped to the feature.

In the example above, we aliased the full_name column to name, so it will be mapped to the name attribute on the User feature class.

Chalk validates your SQL file resolvers when you run chalk apply. If you want to validate your code without deploying, you can run the chalk lint command.

The source key (-- source: postgres) tells Chalk which integration to use to connect to the database.

Other comments in the SQL file resolver are indexed by Chalk and can be searched in the Chalk dashboard. In the example resolver above, the comment get users from postgres becomes the description of the resolver in the dashboard.

Lets also add our get_accounts and get_transactions SQL file resolvers:

src/get_accounts.chalk.sql

-- get accounts from postgres.
-- resolves: Account
-- source: postgres
select
    id,
    user_id,
    balance,
from accounts

src/get_transactions.chalk.sql

-- get transactions from postgres.
-- resolves: Transaction
-- source: postgres
select
    id,
    account_id,
    amount,
    status,
    created_at
from transactions

Deploying!

Now that we’ve written a few resolvers, we can deploy our feature pipeline and query our data in realtime. To deploy you’ll want to run chalk apply. You will be shown the changes that you’ve made since your last deployment and prompted for confirmation

chalk apply
✓ Found resolvers
✓ Successfully validated features and resolvers!
✓ Checked against live resolvers
Added Features

    Name
───────────────────────────
 +  transaction.id
 +  transaction.account_id
 +  transaction.amount
 +  transaction.status
 +  transaction.created_at
 +  transaction.account
 +  user.id
 +  user.name
 +  user.email
 +  user.account
 +  account.id
 +  account.user_id
 +  account.balance
 +  account.user
 +  account.transactions


  Would you like to deploy?  [y/n]

If you accept, chalk will build and deploy you’re new code. Once that’s done, you can start querying your date in realtime!

Querying

Now that we’ve deployed our feature pipeline, we can query our data in realtime. One of the easiest ways to do this is from the Chalk CLI.

$ chalk query --in user.id=1 --out user.name --out user.email

user.name     "John Doe"
email         "john@doe.com"

This query will fetch the name and email attributes from the User feature class for the user with id=1, hitting the PostgreSQL database directly.

Push-down filters

Note that in SQL file resolver that we wrote, we didn’t include a where clause. However, Chalk automatically pushes down filters to the database when querying features. The SQL query that executes against our PostgreSQL database is:

select
  id,
  full_name as name,
  email
from users
where id = 1;

Chalk can also push down non-primary key filters to SQL databases. For example, to fetch all transactions for a user, Chalk will modify the get_transaction resolver to include a where clause for the user’s account_id:

select
  id,
  account_id,
  amount,
  status,
  date
from txns
where account_id = 38;

You can see this in action, by querying for a user’s transactions:

$ chalk query --in user.id=1 --out user.account.transactions
Results

No scalar features

user.account.transactions

 id      account_id   amount    status     created_at
─────────────────────────────────────────────────────────────────────────────────────
 197524  38           12.00     "cleared"  "2023-12-02T21:05:54.057868+00:00"
 198604  38           27.51     "cleared"  "2023-04-29T22:27:12.058023+00:00"
 210326  38           93.27     "cleared"  "2023-02-17T21:49:27.058144+00:00"
 228363  38           1.05      "cleared"  "2023-04-20T21:46:33.058240+00:00"
 230225  38           23.91     "cleared"  "2023-06-09T03:07:44.058314+00:00"
 235551  38           12.20     "failed"   "2022-12-26T15:45:07.058416+00:00"

Derived Features

We’ve noticed that some fraudsters try to link stolen accounts to our platform and attempt to transfer money through our system. To detect this behavior, we want to compute a similarity score between the user’s name and the account’s title.

We’ll start by adding this new feature, account_name_match, to our User feature class.

src/feature_sets.py

@features
class User:
  id: int
  name: str
  email: str
  account: "Account"

  # The similarity between the user's name and the account's title.
  account_name_match: float

Next, we’ll define a resolver that computes this feature. We’ll use Jaccard similarity to compute the similarity score.

src/resolvers.py

from src.feature_sets import User
from chalk import online

@online
def account_name_match(
    title: User.account.title,
    name: User.name,
) -> User.account_name_match:
    """Docstrings show up in the Chalk dashboard"""
    intersection = set(title) & set(name)
    union = set(title) | set(name)
    return len(intersection) / len(union)

The @online decorator tells Chalk that this resolver should be called in realtime when the User.account_name_match feature is requested. Our feature dependencies are declared in the function signature as User.account.title and User.name. Chalk will automatically retrieve User.account_id and User.name with our get_user.chalk.sql resolver. Then, using this account_id, Chalk will retrieve Account.title from the get_account.chalk.sql resolver. Lets deploy our code.

$ chalk apply --branch tutorial
✓ Found resolvers
✓ Deployed branch

Note, this time we are using a branch (by specifying the --branch <name> flag). In testing, new features, we recommend deploying your feature pipeline to a branch, which allows you to test your changes without affecting your production feature pipelines. An additional benefit of branches is that they are incredibly lightweight—they should deploy pretty much instantly.

Testing

Resolvers are callable functions, so we can also test them like any other Python function. Let’s test our new resolver by writing a unit test:

This is one of the only times when you'll explicitly call your resolver functions. We covered this idea already in the intro, but since its one of the central concepts behind Chalk it is worth reiterating. Most of the time, Chalk will determine which resolvers it needs to run to compute the features you've requested in a query. This is just like how a SQL database decides how to compute the output features you've requested.

tests/test_name_match.py

from src.resolvers import account_name_match

def test_names_match():
    """Resolvers can be unit tested exactly as you would expect.

    Here, the `account_name_match` resolver should return 1.0
    because the `title` and `name` are identical.
    """
    assert 1 == account_name_match(
        title="John Coltrane",
        name="John Coltrane",
    )

def test_names_completely_different():
    """The `account_name_match` resolver should return 0
    because the `title` and `name` don't share any characters.
    """
    assert 0 == account_name_match(
        title="John Coltrane",
        name="Zyx",
    )

If you want to learn more, we also provide additional docs on testing resolvers.

Reverse ETL

Let’s say we have a rather complex feature that we can’t serve in realtime because it increases the latency of a query too much for a particular production use case. For instance, lets say we want to add a new feature to our users that indicates the average time between their transactions. Maybe we believe that we can use this to detect whether a new transaction is fraudulent.

To add this feature, we’d add it to the User feature class:

src/feature_sets.py

@features
class User:
  id: int
  name: str
  email: str
  account: "Account"

  # The similarity between the user's name and the account's title.
  account_name_match: float

  # The average time between each of a user's transactions, from the last
  # 30 days, in ms.
  average_time_between_transactions_30d: float

We would then write a resolver to calculate this feature:

from chalk import online
from chalk.features import after

@online
def get_average_time_between_transactions(
    transactions: User.account.transactions[
      Transaction.created_at,
      after(days_ago=30)
    ]
) -> User.average_time_between_transactions:
  """Computes the average time between transactions for a user"""

  # conversion to polars is cheap since both Chalk and Polars DataFrames
  # Use Arrow as their memory representation.
  df = transactions.to_polars()

  # Get the date column, convert to milliseconds timestamps, and sort
  date_col = df.collect().get_column(
    str(Transaction.created_at)
  ).dt.timestamp("ms").sort()

  return (date_col - date_col.shift(1)).mean()

For good measure, lets also add this to our tests and make sure that it is running as expected:

tests/test_average_time.py

import datetime as dt

from chalk.features import DataFrame

from src.resolvers import get_average_time_between_transactions

def test_get_average_time_between_transactions():
    """To unit test Dataframe taking resolvers, you can construct and
    pass in a Chalk DataFrame:
    """
    now = dt.now(tzinfo=dt.timezone.utc),
    transactions = DataFrame({
      Transaction.created_at: [
        # will be filtered out since it did not occur in last 30d
        dt.datetime(1990, 1, 1, tzinfo=dt.timezone.utc),
        now,
        now - dt.timedelta(seconds=1), # diff = 1000
        now - dt.timedelta(seconds=2), # diff = 1000
        now - dt.timedelta(seconds=3), # diff = 1000
        now - dt.timedelta(seconds=4), # diff = 1000
        now - dt.timedelta(seconds=6), # diff = 2000
      ]                                # -----------
    })                                 #        6000 / 5 = 1200

    assert 1200 == get_average_time_between_transactions(
      transactions
    )

Lets deploy this new feature to a branch and run a couple test queries:

chalk apply --branch new-fraud-feature

chalk query \
  --in user.id=1 \
  --out user.average_time_between_transactions_30d \
  --branch new-fraud-feature

After some testing, we realize a couple of things: 1). This feature is too slow to compute in real time (for our use case), 2). This feature doesn’t change very often.

In practice, even executing the above resolver in realtime should be really fast. When building new features, do some testing. You may not even need to reverse-ETL features into the online store to achieve your target latency. Reverse-ETL-ing features into the online store adds state and complexity to your feature pipeline. We recommend only doing this if you are unable to achieve your latency targets.

We decide to reverse-ETL this computed feature from our snowflake data store into our Chalk online store. At a high level, this means that periodically (maybe once a day), we’ll:

Look for all users with new transactions in our snowflake data source,
Recompute the average_time_between_transactions_30d for these users,
Load the new average_time_between_transactions_30d values into the Chalk online store.

Of note, we don’t want to use our PostgreSQL database for this, we want to use Snowflake, which is optimized for bulk data loads unlike Postgres. As a result, we’ll need to define some offline resolvers.

So far we’ve only written online resolvers. You can think of offline resolvers as overrides for online resolvers that are run in bulk data request scenarios. At a high level offline resolvers let you pull your feature data from data stores that are optimized for bulk data loading—this is exactly what we want for our reverse-ETL process.

To set up a reverse-ETL process, we’ll need to do three things:

set a staleness policy on your target feature,
add offline resolvers (technically optional, but strongly recommended),
create a scheduled query.

Setting a Max Staleness on Our Feature

We can have Chalk reverse-ETL our offline data into our online store by setting the max_staleness on a feature and using scheduled queries.

src/feature_sets.py

from chalk.features import feature
@features
class User:
  id: int
  name: str
  email: str
  account: "Account"

  # The similarity between the user's name and the account's title.
  account_name_match: float

  # The average time between each of a user's transactions, from the last
  # 30 days, in ms.
  average_time_between_transactions_30d: float
  average_time_between_transactions_30d: float = feature(max_staleness="infinity")

The max_staleness keyword argument tells Chalk how stale a feature value can get before it should be refreshed. In this case, we’re telling Chalk that we’ll tolerate arbitrarily old feature values. However, we could also specify a max_staleness of 1h or 1d to tell Chalk not to serve feature values that are older than 1 hour or 1 day.

To take advantage of our max_staleness, we need to get computed feature into the online store. Feature are only written to the online store if they (or their feature class) have been given a max_staleness value.

Features with a max_staleness are written to the online store in two ways:

Passively, when a feature value is computed in response to an online query,
Actively, in response to an ETL or scheduled query.

The first time average_time_between_transactions_30d is queried for any given user it will be computed by the get_average_time_between_transactions resolver. All subsequent times that the feature is requested it will be read from the online store (until the feature is evicted from the cache).

Adding Offline Resolvers

By setting up offline SQL resolvers, we are telling Chalk that there are times when we don’t want to pull data from our Postgres data source and instead want to get our data from a bulk data store. There are a couple reasons why we might want to do this: 1). We don’t want to overwhelm our Postgres data source with requests that don’t require real time data, 2). We want to query for large chunks of data (which is the purpose of a database like snowflake).

In the case of reverse-ETLing data, we have decided that we are tolerating some degree of staleness in our average_time_between_transactions_30d feature in exchange for being able to serve the feature faster. As a result, it makes sense to read the new data from snowflake.

To connect our feature classes to Snowflake, we’ll write three more SQL resolvers, this time labeling them with type: offline and source: snowflake.

src/get_user_offline.chalk.sql

-- get users from snowflake.
-- resolves: User
-- source: snowflake
-- type: offline
select
    id,
    full_name as name,
    email
from users

src/get_accounts.chalk.sql

-- get accounts from snowflake.
-- resolves: Account
-- source: snowflake
-- type: offline
select
    id,
    user_id,
    balance,
from accounts

src/get_transactions.chalk.sql

-- get transactions from snowflake.
-- resolves: Transaction
-- source: snowflake
-- type: offline
select
    id,
    account_id,
    amount,
    status,
    created_at
from transactions

Creating a Scheduled Query

With our offline SQL resolvers defined, we can now write a ScheduledQuery which will periodically run and write updated average_time_between_transactions_30d features to the online store.

from chalk import ScheduledQuery

ScheduledQuery(
    name="user-transaction-reverse-etl-features",
    schedule="0 * * * *",
    output=[User.average_time_between_transactions_30d],
    online=True,
    offline=True,
    incremental_resolvers="get_transactions_offline",
)

Our schedule ”0 * * * *”, means that we will run this query and update the online store every hour. This means our features will be one hour fresh (disregarding the snowflake data source lag). Note, this allows really precise and granular trade-offs between staleness, latency, and compute.

In addition, by specifying an incremental resolver, we are telling Chalk to remember the high ingest time from the previous run and only pull data that is fresher than that high water mark. Again Chalk is processing the minimum amount of information it requires to calculate the average_time_between_transactions_30d.

API Calls

Any Python function can be used as a resolver. This means that we can call APIs to compute features. Let’s add a feature that computes the user’s FICO score from our credit scoring vendor, Experian.

As we did earlier, we’ll first add the features that we want to compute:

src/feature_sets.py

from chalk.features import feature

@features
class User:
  id: int
  name: str
  email: str
  account_name_match: float

  # The fraud score, as provided by a third-party vendor.
  fico_score: int = feature(min=300, max=850, strict=True)

  # Tags from our credit scoring vendor.
  credit_score_tags: list[str]

We are adding strict validation to our fico_score feature to ensure that we only store and utilize valid FICO scores.

Now, we can write a resolver to fetch the user’s FICO score from Experian.

src/resolvers.py

from src.feature_sets import User
from src.mocks import experian
from chalk.features import online, Features

@online
def get_fraud_score(
    name: User.name,
    email: User.email,
) -> Features[User.fico_score, User.credit_score_tags]:
    response = experian.get_credit_score(name, email)

    # We don't need to provide all the features for
    # the `User` class, only the ones that we want to update.
    return User(
        fico_score=response['fico_score'],
        credit_score_tags=response['tags'],
    )

Here, we are returning two features of the user, User.fico_score and User.credit_score_tags. We use the Features type to indicate which feature we expect to return. Also note that we are initializing the User class with only the features that we want to update. This partial initialization is the primary difference between Python’s @dataclass and Chalk’s @features.

Setting up a Materialized Windowed Aggregation

When defining aggregations, it is often useful to compute a windowed aggregation to define the same aggregation over different time windows. When defining a windowed aggregation over high-cardinality datasets like transactions, it can be more performant to materialize these aggregations in the online store by defining a materialized windowed aggregation. In a materialized windowed aggregation, you can define time buckets, and Chalk will pre-aggregate the minimum data required over all the data in each time bucket necessary to compute the aggregation for all time windows in the feature definition. For example, say we want to compute the count of transactions associated with an Account over the past 7 days, 30 days, and 90 days. Then we can define the materialized windowed aggregation below:

src/feature_sets.py

from chalk.features import feature
@features
class Account:
  id: int
  user_id: User.id
  balance: float
  user: User
  transactions: DataFrame[Transaction]

  num_transactions: Windowed[int] = windowed(
      "7d",
      "30d",
      "60d",
      materialization={
          "bucket_durations": {"1d": ["7d"], "3d": ["30d", "60d"]},
      },
      expression=_.transactions[
          _.created_at > _.chalk_window,
          _.created_at < _.chalk_now
      ].count(),
  )

This defines a materialized windowed aggregation over the Account.transactions DataFrame, which counts the number of transactions associated over the past 7, 30, and 60 days. The materialization argument tells Chalk to pre-aggregate the data in buckets in the online store, such that the 7-day aggregation is pre-aggregated in 1-day buckets, and the 30-day and 60-day aggregations are pre-aggregated in 3-day buckets. Because this aggregation .count() only requires the count of transactions that fall within each bucket, we would store the count of transactions for each bucket, and at query time sum the counts for the buckets that fall within the requested time window. This configuration allows you to define more performant aggregations over high-cardinality datasets, such as transactions, where recomputing the aggregation could be computationally expensive by materializing the aggregations in buckets.

Because the materialized windowed aggregation relies on the materialized buckets in the online store, and branch deployments do not persist data to the online store, we would need to run a full deploy with chalk apply to use this new feature. After running a chalk apply, we can run chalk aggregate backfill --feature account.num_transactions to populate the buckets, and then we can query the feature as normal.

In production, you could also set up a schedule to periodically backfill the buckets, or stream the data into the online store, to keep the buckets up to date.

Deploying

Finally, we’ll want to deploy our new resolvers. As we did earlier, we can check our work by using a branch deployment:

$ chalk apply --branch tutorial
✓ Found resolvers
✓ Deployed branch

We can then query our new features:

$ chalk query --branch tutorial  \
              --in     user.id=1 \
              --out    user.name_match_score

CLI Query

Now that we’ve written some features and resolvers and deployed them to Chalk, we’re ready to integrate Chalk into our production decisioning systems.

As a sanity check, it can be helpful to use the Chalk CLI to query a well-known input and ensure that we get the expected output.

We can use the chalk query command, passing in the id of a user, and the names of the features we want to resolve:

$ chalk query --in  user.id=1  \
              --out user.name  \
              --out user.email \
              --out user.account.balance
Results
user.name             "John Doe"
email                 "john@doe.com"
user.account.balance  2032.91

API Client Query

Once we’re satisfied that our features and resolvers are working as expected, we can use a client library to query Chalk from our application.

In this first example, we’ll use the ChalkClient in the chalkpy package to query Chalk from our application:

datascience/example_inference.py

from src.feature_sets import User
from chalk.client import ChalkClient

# Create a new Chalk client. By default, this will
# pick up the login credentials generated after running
# `chalk login`.
client = ChalkClient()

client.query(
    input=User(id=1234),
    output=[
        User.id,
        User.name,
        User.fico_score,
        User.account.balance,
    ],
)

We use the same feature definitions for querying our data as we used for defining our features and resolvers.

Chalk has API client libraries in several languages, including Python, Go, Java, Typescript, and Elixir.

Code Generation (Optional)

All API clients can operate on the string names of features. However, in a production system, you may have many hundreds or thousands of features, and want to avoid hard-coding the names of each feature in your code.

To help with this, Chalk can codegen a library of strongly-typed feature names for you.

For example, say the service that calls into Chalk is written in Go. We can generate a Go library of feature names with the following command:

$ chalk codegen go --out ./clients/go/client.go --package=client
✓ Found resolvers
✓    Wrote features to file './clients/go/client.go'
✓    Please do not change the generated code.

This generates a file clients/go/client.go that looks like this:

clients/go/client.go

package client

/**************************************
 Code generated by Chalk. DO NOT EDIT.
 > chalk codegen go --out ./clients/go/client.go --package client
**************************************/

import (
	"github.com/chalk-ai/chalk-go"
	"time"
)

var InitFeaturesErr error

type Account struct {
	Id *int64
	Title *string
	UserId *int64
	Balance *float64
	User *User
	UpdatedAt *time.Time
}

type User struct {
	Id *int64
	Name *string
	Email *string
	Account *Account
	AccountNameMatch *float64
	FicoScore *int64
	CreditScoreTags *[]any
}

var Features struct {
	Account *Account
	User *User
}

func init() {
	InitFeaturesErr = chalk.InitFeatures(&Features)
}

We can then use this library to query Chalk:

import (
    "github.com/chalk-ai/chalk-go"
)

// Create a new Chalk client.
client := chalk.NewClient()

// Create an empty struct to hold the results.
user := User{}

// Query Chalk, and add the results to the struct.
_, err = client.OnlineQuery(
  chalk.OnlineQueryParams{}.
  WithInput(Features.User.Id, 1234).
  WithOutputs(
    Features.User.Id,
    Features.User.LastName,
    Features.User.FicoScore,
    Features.User.Account.Balance,
  ),
  &user,
)

// Now, you can access the properties of the
// user for which there was a matching `output`.
fmt.Println(user.Account.Balance)

If your calling service is written in Python, but you don’t want to take a dependency on the repository that contains your Chalk features, you can generate your Python features into a separate repository:

$ chalk codegen python --out ./clients/python/client.py

You can see the generated code in clients/python/client.py.

If you are generating Python into a subdirectory of your Chalk project, be sure to add an entry to your .chalkignore containing the directory of your generated code (in the above example, clients/). Otherwise, Chalk will find duplicate definitions of your features.

Tutorial: Fraud Detection Pipeline

Introduction

Fraud Detection Pipeline

Define features

Primary keys

Tags, Descriptions, and Owners

Has-One Relationships

Has-Many Relationships

Configuring SQL sources

Adding Data Sources in the Chalk Dashboard

Define Your Data Sources in Your Chalk Code

Online data

Deploying!

Querying

Push-down filters

Derived Features

Testing

Reverse ETL

Setting a Max Staleness on Our Feature

Adding Offline Resolvers

Creating a Scheduled Query

API Calls

Setting up a Materialized Windowed Aggregation

Deploying

CLI Query

API Client Query

Code Generation (Optional)

On this page

​Introduction

​Fraud Detection Pipeline

​Define features

​Primary keys

​Tags, Descriptions, and Owners

​Has-One Relationships

​Has-Many Relationships

​Configuring SQL sources

​Adding Data Sources in the Chalk Dashboard

​Define Your Data Sources in Your Chalk Code

​Online data

​Deploying!

​Querying

​Push-down filters

​Derived Features

​Testing

​Reverse ETL

​Setting a Max Staleness on Our Feature

​Adding Offline Resolvers

​Creating a Scheduled Query

​API Calls

​Setting up a Materialized Windowed Aggregation

​Deploying

​CLI Query

​API Client Query

​Code Generation (Optional)

On this page

Introduction

Fraud Detection Pipeline

Define features

Primary keys

Tags, Descriptions, and Owners

Has-One Relationships

Has-Many Relationships

Configuring SQL sources

Adding Data Sources in the Chalk Dashboard

Define Your Data Sources in Your Chalk Code

Online data

Deploying!

Querying

Push-down filters

Derived Features

Testing

Reverse ETL

Setting a Max Staleness on Our Feature

Adding Offline Resolvers

Creating a Scheduled Query

API Calls

Setting up a Materialized Windowed Aggregation

Deploying

CLI Query

API Client Query

Code Generation (Optional)