Tutorials
Build a feature pipeline for fraud detection.
Chalk helps you build out feature pipelines for training and serving machine learning models.
The building blocks of Chalk are features and resolvers. Features specify what you want your data to look like and resolvers tell Chalk where to find your “data”.
In Chalk, features are defined as Python classes with annotated attributes.
Note how the User
class is annotated with the @features
decorator that’s imported from chalk.features
.
from chalk.features import features
import datetime as dt
@features
class User:
id: int
birthday: dt.datetime
name: str
email: str
# computed features
username: str
age: int
is_adult: bool
Features are the abstract underlying attributes of your data. In the example above `id`, `birthday`, `name`, `email`, `username`, `age`, and `is_adult` are all features. Features are grouped together (namespaced) in what are called feature classes. For instance, the `User` class is a **feature class** (also sometimes referred to as a **feature set**). Specific instances of a feature class are **feature class instances**.
Note, you define your features before defining how they will be computed.
Some of these features might come directly from a database table: for instance, the id, birthday, and name fields that we’ve defined above.
Some might be calculated in real time based on the current timestamp (like the age feature) or a different upstream feature (like the is_adult
feature, which will depend on the age feature).
You could deploy the code above to Chalk and it would be considered valid. However, if you tried to query for a field (for example, if you asked chalk to return the name of a user with id 1) then you’d get an error.
In particular, you would get the following error:
$ chalk query --in user.id=1 --out user.name
Results
No scalar features
Errors
Resolver Not Found user.age
Failed to find any valid resolver for feature: 'user.name' and no `default` value or
`max_staleness` was specified for 'user.name', so the feature cannot be defaulted or
resolved from the online store.
We'll talk more about deploying code to Chalk later, but, if you'd like to, you're welcome to skip ahead to get a better sense of [deploying code to Chalk] and [running queries].
This should not be surprising. Chalk has no connection to your underlying data. So it doesn’t know how to get the name of the user with an id of 1 (or any id for that matter).
In particular, the error message tells use that a “Resolver” was not found. A resolver is a function that takes in features and outputs features. Broadly, there are two types of resolvers: root and non-root. Root resolvers don’t take any features (or take only a primary key) as their input and are used to fetch data from external sources.
In the User
example we sketched out above, you could think of a root resolver as a SQL query that reads from a users
table in an external database.
While there are exceptions, this is what the majority of root resolvers look like.
For example, to read from the users
table in your PostgreSQL database, you might write a get_users.chalk.sql
file, with the following contents:
-- source: PG
-- resolves: User
select id, birthday, email, name from users
By writing this file, you’ve defined a SQL resolver, named get_users
, which knows how to resolve birthday
, name
, or email
for any given id in the users
table.
At this point, you might have noticed that the SQL query above doesn’t explicitly know how to get the birthday or name of a given id, it just returns all users.
Chalk gets around this by pushing filters into your SQL queries.
Lets say you’ve defined the SQL resolver above and deployed it to Chalk. You rerun the query that failed above:
$ chalk query --in user.id=1 --out user.name
Chalk will look at the resolvers you’ve defined and ask “given a User.id, can I calculate
a User.name
”.
It will see the SQL resolver defined above and realize that it can be used to get User.name
from a User.id
.
It will push the id filter into the SQL query, running.
select name from users
where id=1
In addition to the filter a projection will be pushed into your SQL query: only the name
column
is requested.
When running a query, Chalk processes the minimum amount of data required to calculate a desired output features.
Getting this to work fully, will involve setting up a database connector to Chalk and deploying code, both of which we will cover later. For now it is important to note that the process of defining resolvers is incredibly modular.
Now lets look into defining a non-root resolver, which computes a User’s email username.
In Chalk, Non-root resolvers take one of two forms:
We will look at both variants, but functionally they are similar. Both take input features and generate output features.
from chalk import online
from chalk.features import features
import datetime as dt
@features
class User:
id: int
birthday: dt.datetime
name: str
email: str
# computed features
username: str
age: int
is_adult: bool
@online
def get_email_username(email: User.email) -> User.username:
return email.split('@')[0]
As mentioned above, this could also be written as an inline expression, like so:
from chalk import online
from chalk.features import features
import datetime as dt
@features
class User:
id: int
birthday: dt.datetime
name: str
email: str
# computed features
username: str = F.split_part(_.email, delimiter="@", index=0)
age: int
is_adult: bool
The advantage of inline expressions is that they are statically compiled into the execution engine, which often provides performance advtantages. Chalk also tries to statically compile Python resolvers by parsing the AST of the function!
By writing the get_email_username
Python function, we’ve defined another resolver.
If we deploy this new code to Chalk, we can now query for the username of a user with a given id.
$ chalk query --in user.id=1 --out user.username
When running the above query, Chalk will identify that it knows how to determine an email
from a given id using the get_user
resolver.
It will also determine that it knows how to calculate a username for an email using the get_email_username
resolver.
Chalk will execute both these resolvers and return the result.
The focus on data instead of pipelines may be unfamiliar at first. Traditional orchestration platforms like Airflow or Dagster explicitly compose functions which produce data into a DAG of tasks. With Chalk, the DAG of resolvers is defined implicitly by the features they produce. This architecture makes it easy to build out feature pipelines that are reusable and composable. Chalk handles tracking your features for temporal consistency, running your resolvers in parallel, and horizontally scaling your feature pipelines.
Alright, enough with the intro. Lets get started building!
This tutorial walks through the process of building a feature pipeline for fraud detection. We’ll cover the full feature development lifecycle:
In this tutorial, we’ll assume the following existing data architecture:
users
, accounts
, and transactions
.users
, accounts
, and transactions
,
which is periodically updated in response to changes in the PostgreSQL database.Before you get started, make sure you have the Chalk CLI installed.
If you want to skip ahead, you can find the full source code for this tutorial on GitHub.
We’ll start by modeling out the Users
feature class.
We’ll start simple with three scalar features: User.id
, User.name
, and User.email
.
First, we’ll create a new file called feature_sets.py
where we’ll define a User
class decorated with @features
.
from chalk.features import features
@features
class User:
id: int
# The name the user provided to us at signup.
# :owner: identity@chalk.ai
# :tags: pii
name: str
# :tags: pii
email: str
Note that at this point we haven’t defined how to compute these features. We are only thinking about the data that we would like to have.
There are a few things to note here. First, all our feature classes need to have a unique id
field.
By default, this is the field named id
.
However, if you want to use a different feature name as the primary key, you can specify it by annotating the primary key feature with the Primary
type.
from chalk.features import features, Primary
@features
class User:
id: int
user_id: Primary[int]
name: str
email: str
In our features above, we’ve added some comments and annotations. These are optional, but can be useful for documentation and for setting alerting policies. For example, you may wish to send Pagerduty alerts to different teams based on the owner of the related feature.
All of the comments and tags from the code also show up in the Chalk dashboard, and are indexed for search.
For example, we’ve added a pii
tag to the name
and email
fields.
This means that these fields will be treated as personally identifiable information and will be subject to additional restrictions.
Next up, we’ll define a feature class related to our User
.
We’ll call this class Account
and it will represent a user’s bank account.
from chalk.features import features
@features
class Account:
id: int
# The name of the owner of the account.
title: str
# The id of the user that owns this account.
user_id: int
# The balance of the account, in dollars.
balance: float
This should look much like what we did for the User
class.
However, we may want to link these two classes together.
We can do this by adding a user
field to the Account
class.
@features
class Account:
id: int
user_id: int
user_id: User.id
balance: float
# The user that owns this account.
user: User
This denotes that each account has one user, and that the Account.user_id
and the User.id
are equal and of type int
, as described by User.id
.
Once we’ve defined the relationship on one side of the join, we can define the inverse relationship on the other side without needing to specify the foreign key again.
@features
class User:
id: int
name: str
email: str
# The account that this user owns.
account: "Account"
Note that the account annotation in the User feature class is in quotes. This is because the Account class is defined later in the file Chalk will recognize this as a valid feature reference and process it correctly.
The final feature entity that we’ll define in this tutorial is for transactions.
Each account has many transactions and each transaction is linked to a single account.
We’ll define the Transaction
class and link it to our Account
class as follows:
from enum import Enum
from chalk.features import features
from chalk.features import features, DataFrame, FeatureTime
class TransactionStatus(str, Enum):
PENDING = "pending"
CLEARED = "cleared"
FAILED = "failed"
@features
class Transaction:
id: int
# The id of the account that this transaction belongs to, set to a join.
# We refer to features and feature classes defined further down in the file
# using quotation marks, so Chalk will recognize that it is a valid
# feature reference to be processed later.
account_id: "Account.id"
# The amount of the transaction, in dollars.
amount: float
# The status of the transaction, defined as an enum above.
status: TransactionStatus
# When the transaction occurred
created_at: FeatureTime
# Because we define the join condition between
# `Transaction` and `Account` below, we don't
# need to repeat it here.
account: "Account"
@features
class User:
id: int
name: str
email: str
# The account that this user owns.
account: "Account"
@features
class Account:
id: int
user_id: User.id
balance: float
user: User
transactions: DataFrame[Transaction]
This is the first time we’re seeing the DataFrame
type.
A Chalk DataFrame models tabular data in much the same way that Pandas
does. However, there are some key differences that allow the Chalk DataFrame
to increase type safety and performance.
Like pandas, Chalk’s DataFrame
is a two-dimensional data structure with rows and columns.
You can perform operations like filtering, grouping, and aggregating on a DataFrame
.
However, there are two main differences.
DataFrame
is lazy and can be backed by multiple data sources, where a pandas.DataFrame
executes eagerly in memory.DataFrame[...]
can be used to represent a type of data with pre-defined filters.You might also notice that we used an Enum
feature.
Chalk supports many feature types, including Enums
, lists/sets, and dataclasses
.
We also added a created_at
field to the Transaction
class of type FeatureTime
.
FeatureTime
is a special annotation that can only be assigned to a single feature in a feature class.
This feature specifies the logical time of a feature instance—it is used for point in time correctness and incremental updating.
We’ll cover both later in this tutorial.
The primary source of data for most companies are SQL databases (note, by SQL database we don’t mean relational databases: we mean any database that can be queried with SQL—in practice this means pretty much any of them). Chalk can automatically ingest data from SQL databases and map results into feature classes.
In our example application, we have two databases: PostgreSQL and Snowflake. Our PostgreSQL database is the primary database used in our codebase, and our Snowflake database is used for analytics, with tables populated from DBT views and batch jobs.
To configure your SQL sources in Chalk, we’ll: 1). Add your data sources to the Chalk dashboard 2). Define the data sources in your Chalk code.
To add your data sources to the Chalk dashboard, go to the Data Sources
tab of the dashboard and click on the Add a data source
button.
You can then select the type of data source you want to add.
Select PostgreSQL
and fill in the required information.
Repeat the same process with a Snowflake data source.
Keep track of the names that you’ve given to each of your data sources: these will be used in the next step. We recommend naming your PostgreSQL
data source postgres
and your Snowflake
data source snowflake
.
Note, the data source connection information you add in the dashboard will be stored securely using the secrets manager of your cloud provider.
To define your data sources, you’ll need to create a datasources.py
file that contains a SnowflakeSource
and a PostgreSQLSource
:
from chalk.sql import SnowflakeSource, PostgreSQLSource
# if you named your postgres or snowflake resolvers something other than `postgres`
# and `snowflake`, you'll have to update the code accordingly.
snowflake = SnowflakeSource(name="postgres")
postgres = PostgreSQLSource(name="snowflake")
Now that we’ve defined our data sources, we can use them to create root SQL resolvers.
—
Chalk’s preferred way to ingest data from SQL databases is to use SQL file resolvers.
With SQL file resolvers, you can write your root resolvers in SQL
, and use whatever database tooling you’re familiar with to test, lint, and debug your code.
To create a SQL resolver, add a file to your project directory with the extension .chalk.sql
.
You can now write a SQL query in this file.
You will also need to add metadata to the top of the file to tell Chalk how to ingest the data.
For example, say that we want to resolve the name
and email
features of our User feature class from a PostgreSQL table.
To do this, we can write the following SQL file resolver:
-- get users from postgres.
-- resolves: User
-- source: postgres
select
id,
full_name as name,
email
from users
The resolves
key (--resolves: User
) tells Chalk which feature class the columns in the select
statement should be mapped into.
Then, the target names of the query are compared against the names of the features on the specified feature class.
If the names match after stripping underscores and lower-casing, the select target is mapped to the feature.
In the example above, we aliased the full_name
column to name
, so it will be mapped to the name
attribute on the User
feature class.
Chalk validates your SQL file resolvers when you run chalk apply
.
If you want to validate your code without deploying, you can run the chalk lint
command.
The source
key (-- source: postgres
) tells Chalk which integration to use to connect to the database.
Other comments in the SQL file resolver are indexed by Chalk and can be searched in the Chalk dashboard.
In the example resolver above, the comment get users from postgres
becomes the description of the resolver in the dashboard.
Lets also add our get_accounts
and get_transactions
SQL file resolvers:
-- get accounts from postgres.
-- resolves: Account
-- source: postgres
select
id,
user_id,
balance,
from accounts
-- get transactions from postgres.
-- resolves: Transaction
-- source: postgres
select
id,
account_id,
amount,
status,
created_at
from transactions
Now that we’ve written a few resolvers, we can deploy our feature pipeline and query our data in realtime.
To deploy you’ll want to run chalk apply
.
You will be shown the changes that you’ve made since your last deployment and prompted for confirmation
chalk apply
✓ Found resolvers
✓ Successfully validated features and resolvers!
✓ Checked against live resolvers
Added Features
Name
───────────────────────────
+ transaction.id
+ transaction.account_id
+ transaction.amount
+ transaction.status
+ transaction.created_at
+ transaction.account
+ user.id
+ user.name
+ user.email
+ user.account
+ account.id
+ account.user_id
+ account.balance
+ account.user
+ account.transactions
Would you like to deploy? [y/n]
If you accept, chalk will build and deploy you’re new code. Once that’s done, you can start querying your date in realtime!
Now that we’ve deployed our feature pipeline, we can query our data in realtime. One of the easiest ways to do this is from the Chalk CLI.
$ chalk query --in user.id=1 --out user.name --out user.email
user.name "John Doe"
email "john@doe.com"
This query will fetch the name
and email
attributes from the User
feature class for the user with id=1
, hitting the PostgreSQL database directly.
Note that in SQL file resolver that we wrote, we didn’t include a where
clause.
However, Chalk automatically pushes down filters to the database when querying features.
The SQL query that executes against our PostgreSQL database is:
select
id,
full_name as name,
email
from users
where id = 1;
Chalk can also push down non-primary key filters to SQL databases.
For example, to fetch all transactions for a user, Chalk will modify the get_transaction resolver to include a where
clause for the user’s account_id
:
select
id,
account_id,
amount,
status,
date
from txns
where account_id = 38;
You can see this in action, by querying for a user’s transactions:
$ chalk query --in user.id=1 --out user.account.transactions
Results
No scalar features
user.account.transactions
id account_id amount status created_at
─────────────────────────────────────────────────────────────────────────────────────
197524 38 12.00 "cleared" "2023-12-02T21:05:54.057868+00:00"
198604 38 27.51 "cleared" "2023-04-29T22:27:12.058023+00:00"
210326 38 93.27 "cleared" "2023-02-17T21:49:27.058144+00:00"
228363 38 1.05 "cleared" "2023-04-20T21:46:33.058240+00:00"
230225 38 23.91 "cleared" "2023-06-09T03:07:44.058314+00:00"
235551 38 12.20 "failed" "2022-12-26T15:45:07.058416+00:00"
We’ve noticed that some fraudsters try to link stolen accounts to our platform and attempt to transfer money through our system. To detect this behavior, we want to compute a similarity score between the user’s name and the account’s title.
We’ll start by adding this new feature, account_name_match
, to our User
feature class.
@features
class User:
id: int
name: str
email: str
account: "Account"
# The similarity between the user's name and the account's title.
account_name_match: float
Next, we’ll define a resolver that computes this feature. We’ll use Jaccard similarity to compute the similarity score.
from src.feature_sets import User
from chalk import online
@online
def account_name_match(
title: User.account.title,
name: User.name,
) -> User.account_name_match:
"""Docstrings show up in the Chalk dashboard"""
intersection = set(title) & set(name)
union = set(title) | set(name)
return len(intersection) / len(union)
The @online
decorator tells Chalk that this resolver should be called
in realtime when the User.account_name_match
feature is requested.
Our feature dependencies are declared in the function signature as User.account.title
and User.name
.
Chalk will automatically retrieve User.account_id
and User.name
with our get_user.chalk.sql
resolver.
Then, using this account_id, Chalk will retrieve Account.title
from the get_account.chalk.sql
resolver. Lets deploy our code.
$ chalk apply --branch tutorial
✓ Found resolvers
✓ Deployed branch
Note, this time we are using a branch
(by specifying the --branch <name>
flag).
In testing, new features, we recommend deploying your feature pipeline to a branch, which allows you to test your changes without affecting your production feature pipelines.
An additional benefit of branches is that they are incredibly lightweight—they should deploy pretty much instantly.
Resolvers are callable functions, so we can also test them like any other Python function. Let’s test our new resolver by writing a unit test:
This is one of the only times when you'll explicitly call your resolver functions. We covered this idea already in the intro, but since its one of the central concepts behind Chalk it is worth reiterating. Most of the time, Chalk will determine which resolvers it needs to run to compute the features you've requested in a query. This is just like how a SQL database decides how to compute the output features you've requested.
from src.resolvers import account_name_match
def test_names_match():
"""Resolvers can be unit tested exactly as you would expect.
Here, the `account_name_match` resolver should return 1.0
because the `title` and `name` are identical.
"""
assert 1 == account_name_match(
title="John Coltrane",
name="John Coltrane",
)
def test_names_completely_different():
"""The `account_name_match` resolver should return 0
because the `title` and `name` don't share any characters.
"""
assert 0 == account_name_match(
title="John Coltrane",
name="Zyx",
)
If you want to learn more, we also provide additional docs on testing resolvers.
Let’s say we have a rather complex feature that we can’t serve in realtime because it increases the latency of a query too much for a particular production use case. For instance, lets say we want to add a new feature to our users that indicates the average time between their transactions. Maybe we believe that we can use this to detect whether a new transaction is fraudulent.
To add this feature, we’d add it to the User
feature class:
@features
class User:
id: int
name: str
email: str
account: "Account"
# The similarity between the user's name and the account's title.
account_name_match: float
# The average time between each of a user's transactions, from the last
# 30 days, in ms.
average_time_between_transactions_30d: float
We would then write a resolver to calculate this feature:
from chalk import online
from chalk.features import after
@online
def get_average_time_between_transactions(
transactions: User.account.transactions[
Transaction.created_at,
after(days_ago=30)]
]
) -> User.average_time_between_transactions:
"""Computes the average time between transactions for a user"""
# conversion to polars is cheap since both Chalk and Polars DataFrames
# Use Arrow as their memory representation.
df = transactions.to_polars()
# Get the date column, convert to milliseconds timestamps, and sort
date_col = df.collect().get_column(
str(Transaction.created_at)
).dt.timestamp("ms").sort()
return (date_col - date_col.shift(1)).mean()
For good measure, lets also add this to our tests and make sure that it is running as expected:
import datetime as dt
from chalk.features import DataFrame
from src.resolvers import get_average_time_between_transactions
def test_get_average_time_between_transactions():
"""To unit test Dataframe taking resolvers, you can construct and
pass in a Chalk DataFrame:
"""
now = dt.now(tzinfo=dt.timezone.utc),
transactions = DataFrame({
Transaction.created_at: [
# will be filtered out since it did not occur in last 30d
dt.datetime(1990, 1, 1, tzinfo=dt.timezone.utc),
now,
now - dt.timedelta(seconds=1), # diff = 1000
now - dt.timedelta(seconds=2), # diff = 1000
now - dt.timedelta(seconds=3), # diff = 1000
now - dt.timedelta(seconds=4), # diff = 1000
now - dt.timedelta(seconds=6), # diff = 2000
] # -----------
}) # 6000 / 5 = 1200
assert 1200 == get_average_time_between_transactions(
transactions
)
Lets deploy this new feature to a branch and run a couple test queries:
chalk apply --branch new-fraud-feature
chalk query \
--in user.id=1 \
--out user.average_time_between_transactions_30d \
--branch new-fraud-feature
After some testing, we realize couple of things: 1). This feature is too slow to compute in real time (for our use case), 2). This feature doesn’t change very often.
In practice, even executing the above resolver in realtime should be really fast. When building new features, do some testing. You may not even need to reverse-ETL features into the online store to achieve your target latency. Reverse-ETL-ing features into the online store adds state and complexity to your feature pipeline. We recommend only doing this if you are unable to achieve your latency targets.
We decide to reverse-ETL this computed feature from our snowflake data store into our Chalk online store. At a high level, this means that periodically (maybe once a day), we’ll:
average_time_between_transactions_30d
for these users,average_time_between_transactions_30d
values into the Chalk online store.Of note, we don’t want to use our PostgreSQL database for this, we want to use Snowflake, which is optimized for bulk data loads unlike Postgres.
As a result, we’ll need to define some offline
resolvers.
So far we’ve only written online
resolvers.
You can think of offline resolvers as overrides for online resolvers that are run in bulk data request scenarios.
At a high level offline resolvers let you pull your feature data from data stores that are optimized for bulk data loading—this is exactly what we want for our reverse-ETL process.
To set up a reverse-ETL process, we’ll need to do three things:
We can have Chalk reverse-ETL our offline data into our online store by setting the max_staleness
on a feature and using scheduled queries.
from chalk.features import feature
@features
class User:
id: int
name: str
email: str
account: "Account"
# The similarity between the user's name and the account's title.
account_name_match: float
# The average time between each of a user's transactions, from the last
# 30 days, in ms.
average_time_between_transactions_30d: float
average_time_between_transactions_30d: float = feature(max_staleness="infinity")
The max_staleness
keyword argument tells Chalk how stale a feature value can get before it should be refreshed.
In this case, we’re telling Chalk that we’ll tolerate arbitrarily old feature values.
However, we could also specify a max_staleness
of 1h
or 1d
to tell Chalk not to serve feature values that are older than 1 hour or 1 day.
To take advantage of our max_staleness, we need to get computed feature into the online store.
Feature are only written to the online store if they (or their feature class) have been given a max_staleness
value.
Features with a max_staleness
are written to the online store in two ways:
The first time average_time_between_transactions_30d
is queried for any given user it will be computed by the get_average_time_between_transactions
resolver.
All subsequent times that the feature is requested it will be read from the online store (until the feature is evicted from the cache).
By setting up offline SQL resolvers, we are telling Chalk that there are times when we don’t want to pull data from our Postgres data source and instead want to get our data from a bulk data store. There are a couple reasons why we might want to do this: 1). We don’t want to overwhelm our Postgres data source with requests that don’t require real time data, 2). We want to query for large chunks of data (which is the purpose of a database like snowflake).
In the case of reverse-ETLing data, we have decided that we are tolerating some degree of staleness in our average_time_between_transactions_30d
feature in exchange for being able to serve the feature faster.
As a result, it makes sense to read the new data from snowflake.
To connect our feature classes to Snowflake, we’ll write three more SQL resolvers, this time labeling them with type: offline
and source: snowflake
.
-- get users from snowflake.
-- resolves: User
-- source: snowflake
-- type: offline
select
id,
full_name as name,
email
from users
-- get accounts from snowflake.
-- resolves: Account
-- source: snowflake
-- type: offline
select
id,
user_id,
balance,
from accounts
-- get transactions from snowflake.
-- resolves: Transaction
-- source: snowflake
-- type: offline
select
id,
account_id,
amount,
status,
created_at
from transactions
With our offline SQL resolvers defined, we can now write a ScheduledQuery
which will periodically run and write updated average_time_between_transactions_30d
features to the online store.
from chalk import ScheduledQuery
ScheduledQuery(
name="user-transaction-reverse-etl-features",
schedule="0 * * * *",
output=[User.average_time_between_transactions_30d],
online=True,
offline=True,
incremental_resolvers="get_transactions_offline",
)
Our schedule ”0 * * * *
”, means that we will run this query and update the online store every hour.
This means our features will be one hour fresh (disregarding the snowflake data source lag).
Note, this allows really precise and granular trade-offs between staleness, latency, and compute.
In addition, by specifying an incremental resolver, we are telling Chalk to remember the high ingest time from the previous run and only pull data that is fresher than that high water mark.
Again Chalk is processing the minimum amount of information it requires to calculate the average_time_between_transactions_30d
.
Any Python function can be used as a resolver. This means that we can call APIs to compute features. Let’s add a feature that computes the user’s FICO score from our credit scoring vendor, Experian.
As we did earlier, we’ll first add the features that we want to compute:
from chalk.features import feature
@features
class User:
id: int
name: str
email: str
account_name_match: float
# The fraud score, as provided by a third-party vendor.
fico_score: int = feature(min=300, max=850, strict=True)
# Tags from our credit scoring vendor.
credit_score_tags: list[str]
We are adding strict validation to our fico_score
feature to ensure that we only store and utilize valid FICO scores.
Now, we can write a resolver to fetch the user’s FICO score from Experian.
from src.feature_sets import User
from src.mocks import experian
from chalk.features import online, Features
@online
def get_fraud_score(
name: User.name,
email: User.email,
) -> Features[User.fico_score, User.credit_score_tags]:
response = experian.get_credit_score(name, email)
# We don't need to provide all the features for
# the `User` class, only the ones that we want to update.
return User(
fico_score=response['fico_score'],
credit_score_tags=response['tags'],
)
Here, we are returning two features of the user, User.fico_score
and User.credit_score_tags
.
We use the Features
type to indicate which feature we expect to return.
Also note that we are initializing the User
class with only the features that we want to update.
This partial initialization is the primary difference between Python’s @dataclass
and Chalk’s @features
.
Finally, we’ll want to deploy our new resolvers. As we did earlier, we can check our work by using a branch deployment:
$ chalk apply --branch tutorial
✓ Found resolvers
✓ Deployed branch
We can then query our new features:
$ chalk query --branch tutorial \
--in user.id=1 \
--out user.name_match_score
Now that we’ve written some features and resolvers and deployed them to Chalk, we’re ready to integrate Chalk into our production decisioning systems.
As a sanity check, it can be helpful to use the Chalk CLI to query a well-known input and ensure that we get the expected output.
We can use the chalk query
command, passing in the id of a user, and the names of the features we want to resolve:
$ chalk query --in user.id=1 \
--out user.name \
--out user.email \
--out user.account.balance
Results
user.name "John Doe"
email "john@doe.com"
user.account.balance 2032.91
Once we’re satisfied that our features and resolvers are working as expected, we can use a client library to query Chalk from our application.
In this first example, we’ll use the
ChalkClient
in the
chalkpy
package
to query Chalk from our application:
from src.feature_sets import User
from chalk.client import ChalkClient
# Create a new Chalk client. By default, this will
# pick up the login credentials generated after running
# `chalk login`.
client = ChalkClient()
client.query(
input=User(id=1234),
output=[
User.id,
User.name,
User.fico_score,
User.account.balance,
],
)
We use the same feature definitions for querying our data as we used for defining our features and resolvers.
Chalk has API client libraries in several languages, including Python, Go, Java, Typescript, and Elixir.
All API clients can operate on the string names of features. However, in a production system, you may have many hundreds or thousands of features, and want to avoid hard-coding the names of each feature in your code.
To help with this, Chalk can codegen a library of strongly-typed feature names for you.
For example, say the service that calls into Chalk is written in Go. We can generate a Go library of feature names with the following command:
$ chalk codegen go --out ./clients/go/client.go --package=client
✓ Found resolvers
✓ Wrote features to file './clients/go/client.go'
✓ Please do not change the generated code.
This generates a file
clients/go/client.go
that looks like this:
package client
/**************************************
Code generated by Chalk. DO NOT EDIT.
> chalk codegen go --out ./clients/go/client.go --package client
**************************************/
import (
"github.com/chalk-ai/chalk-go"
"time"
)
var InitFeaturesErr error
type Account struct {
Id *int64
Title *string
UserId *int64
Balance *float64
User *User
UpdatedAt *time.Time
}
type User struct {
Id *int64
Name *string
Email *string
Account *Account
AccountNameMatch *float64
FicoScore *int64
CreditScoreTags *[]any
}
var Features struct {
Account *Account
User *User
}
func init() {
InitFeaturesErr = chalk.InitFeatures(&Features)
}
We can then use this library to query Chalk:
import (
"github.com/chalk-ai/chalk-go"
)
// Create a new Chalk client.
client := chalk.NewClient()
// Create an empty struct to hold the results.
user := User{}
// Query Chalk, and add the results to the struct.
_, err = client.OnlineQuery(
chalk.OnlineQueryParams{}.
WithInput(Features.User.Id, 1234).
WithOutputs(
Features.User.Id,
Features.User.LastName,
Features.User.FicoScore,
Features.User.Account.Balance,
),
&user,
)
// Now, you can access the properties of the
// user for which there was a matching `output`.
fmt.Println(user.Account.Balance)
If your calling service is written in Python, but you don’t want to take a dependency on the repository that contains your Chalk features, you can generate your Python features into a separate repository:
$ chalk codegen python --out ./clients/python/client.py
You can see the generated code in clients/python/client.py
.
If you are generating Python into a subdirectory of your Chalk project, be sure to add an entry to your .chalkignore
containing the directory of your generated code (in the above example, clients/
).
Otherwise, Chalk will find duplicate definitions of your features.