Development
Learn our best practices for building and maintaining your Chalk solution
With Chalk, the same solution can be implemented in a number of different ways. Below are some guidelines and recommended patterns for building and maintaining your Chalk solution!
Define integrations through the Chalk dashboard and check that they’re connecting properly using the Test Data Source
button.
Though this works, it can lead to ambiguity in SQL resolvers when you add multiple data sources of the same type. This occurs because Chalk lets you refer to data sources by their type if you’ve only linked one data source of that type to Chalk. In general, to future-proof your resolvers, you should refer to them by their name.
Features fully specify what you want your data to look like. Once you have an understanding of the inner relations, writing resolvers for your features becomes easier.
While Chalk allows for dataclass feature types, they should be avoided. They don’t always play nicely with serialization and can cause tough-to-debug errors. We recommend either unpacking the nested class into basic types or defining and joining an additional Chalk feature set if the nested component is truly a separate entity. Defining a separate feature set also makes the underlying data easier to monitor and test.
We strongly recommend annotating your features as you are developing them. Your feature annotations show up in the Chalk dashboard and are a good way of documenting your code. You can also add tags and owners to your features, which can be used for aggregation and filtering in the Chalk Dashboard.
from chalk import features
from datetime import datetime
@features
class User:
id: str
# the user's fullname
# :owner: mary.shelley@aol.com
# :tags: team:identity, priority:high
name: str
# the user's birthdate
# :owner: mary.shelley@aol.com
# :tags: team:identity, priority:high
birthday: datetime
If you want to apply a tag or owner to all features in a feature set, this should be done in the feature decorator, like so:
from chalk import features
@features(owner="ada.lovelace@aol.com", tags=['group:risk'])
class User:
id: str
# the user's fullname.
name: str
You can also apply restrictions or enforce feature annotation for your entire project. For instance, you can block deployment if features are not tagged or described.
Features should be defined in separate files from resolvers (except underscore features).
Although it can get a bit lengthy, we recommend starting by defining your features in a single file. This makes expressing joins between features easier and prevents circular dependencies.
Validations for your features can prevent incorrect data from being written to your offline store. They can provide an even stricter complement to monitoring, ensuring that nothing is going wrong with the feature you are calculating.
Joins between feature sets can be specified in a number of different ways. We recommend using an implicit join syntax, which we cover in the join section of the docs.
Underscore resolvers should be used for relatively simple calculations: your underscore resolver definitions should fit inline.
While Python resolvers can be used to read data from your data sources, SQL resolvers are preferred. SQL resolvers allow for direct execution against your data sources and additional optimizations, making them more efficient.
Select statements in SQL resolvers should be explicit: avoid using the *
syntax.
Resolver names are used in the Chalk dashboard to identify resolvers. They should be clear and concise.
Resolver inputs and outputs must belong to the same feature set, but joins can allow resolvers to connect data between feature sets.
Don’t worry about converting Chalk DataFrames to Pandas or Polars in a Python resolver—the transformation is cheap. We use arrow (and so do Pandas and Polars) so moving data from a Chalk DataFrame to either is close to free
The Chalk CLI should be used to run simple online queries. For more complex use cases, you should use one of Chalk’s API clients.
Naming queries makes it easier to evaluate and track the performance of specific queries over time. This can be done using the query_name
parameter in the query functions of your client of choice.
Use the branch server to test that your deployments and new features are behaving as expected.
company_chalk/
├── src/
│ ├── resolvers/
│ │ ├── .../
│ │ ├── __init__.py
│ │ └── pipelines.py
│ ├── __init__.py
│ ├── datasources.py
│ └── feature_sets.py
├── tests/
│ └── ...
├── notebooks/
│ └── ...
├── .chalkignore
├── chalk.yaml
├── README.md
└── requirements.txt
Your .chalkignore
file should include your scripts, notebooks, and tests: anything that you are not actively using in your deployment should be put there so that
non-deployment code does not clutter or interfere with your deployment.
Global setup for resolvers should be done through a function decorated with the @before_all decorator. This also allows for unique setups for different environments.
To access files packaged in your chalk deployments, use the TARGET_ROOT environment variable to fully specify the path to your files.
For instance, if you have the following directory which you are deploying to Chalk:
example/
├── chalk.yaml
├── features.py
├── model.joblib
└── resolvers.py
You would access the model.joblib file as follows:
import os
model_file_path=f"{os.environ['TARGET_ROOT']}/model.joblib"
Monitoring helps you catch tricky bugs early and gives you guarantees about the data you are generating and serving. You should configure monitoring for your important features and resolvers early in your implementation.
Chalk makes it easy to set up unit tests for your resolvers using Pytest or any other python testing framework.
Chalk has a GitHub Actions integration—you can use it to create branch deployments or run queries as part of your code development cycle.
Scope your Chalk user permissions with access tokens. These can be programmatically generated through Chalk clients or in the dashboard. Give your users only the permissions they need.