Chalk home page
Docs
API
CLI
  1. Overview
  2. How do I use Chalk?

Introduction

Chalk is a feature store that enables data engineers and data scientists on production machine learning teams to collaborate efficiently and effectively.

In this tutorial, we will guide you step by step through customizing your Chalk solution to help you achieve all of your data goals.


Creating your Chalk Project

  1. If you don’t already have a Github repository where you will store your Chalk code, create one. Then, pick a local directory in which to work on Chalk code, and clone the Github repository there. Then cd inside of the directory.
  2. If you haven’t already, install Python. You can do this through Homebrew as follows:
Terminal
# Install Homebrew
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Use Homebrew to install python3
$ brew install python
  1. Install the Chalk command line tool. The Chalk CLI allows you to create, update, and manage your feature pipelines directly from your terminal
curl -s -L https://api.chalk.ai/install.sh | sh
  1. Create a Python virtual environment within your repository root directory and activate it
$ python3.10 -m venv .venv
$ source .venv/bin/activate

You can run source .venv/bin/activate to activate the virtual environment, and deactivate to deactivate. 5. Login by typing chalk login. If you’re using a dedicated environment, make sure you use the --api-host flag. Type y when prompted and login in the browser. 6. Run chalk init to initialize your project files. This will initialize a directory structure with an empty src directory and three files—chalk.yaml, README.md, and requirements.txt.

root_directory/
├── src/
├── chalk.yaml
├── README.md
└── requirements.txt

The first file, chalk.yaml, stores configuration information about your project. You can edit this so it contains the following

project: {YOUR_PROJECT_NAME}
environments:
  default:
    requirements: requirements.txt
    runtime: python310

The second file, README.md, contains some basic commands you can use with the Chalk CLI. The third file, requirements.txt should look something like this:

requests
chalkpy[runtime]

This contains your project dependencies. You can add a file .chalkignore which will act just like a .gitignore file, and exclude the specified files from deploys when you run chalk apply (more on this later). 7. Within your virtual environment, install the project requirements.

Terminal
$ pip3 install -r requirements.txt

Now, you should have a virtual environment with all project dependencies installed and a basic project structure upon which we will build in the next step.

Configure Datasources

Having set up a basic project directory, next we’ll want to configure the data sources from which we will load data to compute our features. We can do so in the Chalk dashboard.

  1. In the same directory from before, login or signup with Chalk directly from the command line. The chalk login command will open your browser and create an API token for your local development, as well as redirect you to your dashboard. If you are not redirected you can also find the dashboard at https://chalk.ai/projects or run the chalk dashboard command. You will automatically see all of the environments in which the email you used to login has been provisioned as a user.
  2. Within the dashboard, navigate to Data Sources in the side bar and add all data sources that you will be working with here. After you have saved a data source, you can use the Test Data Source button in the upper right hand corner of the Data Source configuration view to verify that your connection is valid.
  3. Within the working directory, we’ll add a datasources.py file under our src folder to reference the data sources that we’ve added in the dashboard.
    root_directory/
    ├── src/
    │  ├── __init__.py
    │  └── datasources.py
    ├── chalk.yaml
    ├── README.md
    └── requirements.txt
    

Say we added a PostgreSQL data source, then our datasources.py might look something like this:

pg = PostgreSQLSource(name='PG')

For more details on setting up data sources, see here.


Define Feature Sets and Resolvers

Next, we’ll define our feature sets and resolvers. Each feature set is a Python class of features, and each resolver tells Chalk how to compute the values for different features. Each feature that we write should correspond to a resolver output.

We recommend starting with a minimal feature set and building up iteratively to easily test your code along the way. After writing some feature sets, resolvers, and tests, we would expect to see a directory structure like this:

root_directory/
├── src/
│  ├── resolvers/
│  │  ├── .../
│  │  ├── __init__.py
│  │  └── pipelines.py
│  ├── __init__.py
│  ├── datasources.py
│  └── feature_sets.py
├── tests/
│  └── ...
├── .chalkignore
├── chalk.yaml
├── README.md
└── requirements.txt

You can read more in our docs about the different kinds of features and the different kinds of resolvers that you can write. If you would like guidance on how to structure your feature sets and resolvers, please reach out in your support channel!


Deploy and Query

Now, you can deploy the features and resolvers that you wrote! You can deploy to production by using the chalk apply command. During development, we recommend that you use chalk apply --branch {BRANCH_NAME} to deploy to the branch server, which allows multiple people to work concurrently in one environment, and also enables more performant deploys.

Once you have deployed your code, then you can query your features directly from the command line using the chalk query command, or by calling one of our Chalk Clients in code. Chalk has a Python client, Go client, Java client, and a Typescript client.

This is the primary workflow for iterating on features and resolvers! Write, deploy, and query to verify whether the feature values that you receive are the values you expect. Once you have finalized your feature set and resolver definitions, the final step is frequently orchestration.


Orchestration

Having verified that your feature and resolver definitions are correct, the next step is to determine how you want to use the corresponding feature values within your larger machine learning platform. Some users trigger resolvers and run queries from within other orchestrated pipelines, such as Airflow. Some users define cron schedulesfor their resolvers, and set staleness values on different features to ensure that the data they query falls within their requirements for freshness. The data world is your oyster!

But, as always, if you would like guidance on how to configure that oyster, please reach out in your support channel!

Further Resources

For a detailed tutorial on how to build a fraud model using Chalk, see here.