Getting Started
Customizing your Chalk solution.
Chalk is a feature store that enables data engineers and data scientists on production machine learning teams to collaborate efficiently and effectively.
In this tutorial, we will guide you step by step through customizing your Chalk solution to help you achieve all of your data goals.
cd
inside of the
directory.# Install Homebrew
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Use Homebrew to install python3
$ brew install python
curl -s -L https://api.chalk.ai/install.sh | sh
$ python3.10 -m venv .venv
$ source .venv/bin/activate
You can run source .venv/bin/activate
to activate the virtual environment, and deactivate
to deactivate.
Login by typing chalk login
. If you’re using a dedicated environment, make sure you use the
--api-host
flag. Type y
when prompted and login in the browser.
Run chalk init
to initialize your project files. This will initialize a directory structure with an
empty src
directory and three files: chalk.yaml
, README.md
, and requirements.txt
.
root_directory/
├── src/
├── chalk.yaml
├── README.md
└── requirements.txt
The first file, chalk.yaml
, stores configuration information about your project. You can edit this so it
contains the following
project: {YOUR_PROJECT_NAME}
environments:
default:
requirements: requirements.txt
runtime: python310
The second file, README.md
, contains some basic commands you can use with the Chalk CLI.
The third file, requirements.txt
should look something like this:
requests
chalkpy[runtime]
This contains your project dependencies.
You can add a file .chalkignore
which will act just like a .gitignore
file, and exclude the specified files from
deploys when you run chalk apply
(more on this later).
$ pip3 install -r requirements.txt
Now, you should have a virtual environment with all project dependencies installed and a basic project structure upon which we will build in the next step.
Having set up a basic project directory, next we’ll want to configure the data sources from which we will load data to compute our features. We can do so in the Chalk dashboard.
chalk login
command will open your browser and create an
API token for your local development, as well as redirect you to your dashboard. If you are not
redirected you can also find the dashboard at https://chalk.ai/projects
or run the chalk dashboard
command.
You will automatically see all the environments in which the email you used to log in has been
provisioned as a user.Data Sources
in the sidebar and add all data sources that you will
be working with here. After you have saved a data source, you can use the Test Data Source
button
in the upper right hand corner of the Data Source configuration view to verify that your connection is
valid.datasources.py
file under our src
folder to reference the
data sources that we’ve added in the dashboard.root_directory/
├── src/
│ ├── __init__.py
│ └── datasources.py
├── chalk.yaml
├── README.md
└── requirements.txt
Say we added a PostgreSQL data source, then our datasources.py
might look something like this:
pg = PostgreSQLSource(name='PG')
For more details on setting up data sources, see here.
Next, we’ll define our feature classes and resolvers. Each feature class is a Python class of features, and each resolver tells Chalk how to compute the values for different features. Each feature that we write should correspond to a resolver output.
We recommend starting with a minimal feature class and building up iteratively to easily test your code along the way. After writing some feature classes, resolvers, and tests, we would expect to see a directory structure like this:
root_directory/
├── src/
│ ├── resolvers/
│ │ ├── .../
│ │ ├── __init__.py
│ │ └── pipelines.py
│ ├── __init__.py
│ ├── datasources.py
│ └── feature_sets.py
├── tests/
│ └── ...
├── .chalkignore
├── chalk.yaml
├── README.md
└── requirements.txt
You can read more in our docs about the different kinds of features and the different kinds of resolvers that you can write. If you would like guidance on how to structure your feature classes and resolvers, please reach out in your support channel!
Now, you can deploy the features and resolvers that you wrote! You can deploy to production by
using the chalk apply
command. During development, we recommend that you use
chalk apply --branch {BRANCH_NAME}
to deploy to the branch server, which allows
multiple people to work concurrently in one environment, and also enables more performant deploys.
Once you have deployed your code, then you can query your features directly from the command line using the
chalk query
command, or by calling one of our Chalk Clients in code.
Chalk has a Python client, Go client,
Java client, and a Typescript client.
This is the primary workflow for iterating on features and resolvers! Write, deploy, and query to verify whether the feature values that you receive are the values you expect. Once you have finalized your feature class and resolver definitions, the final step is frequently orchestration.
Having verified that your feature and resolver definitions are correct, the next step is to determine how you want to use the corresponding feature values within your larger machine learning platform. Some users trigger resolvers and run queries from within other orchestrated pipelines, such as Airflow. Some users define cron schedulesfor their resolvers, and set staleness values on different features to ensure that the data they query falls within their requirements for freshness. The data world is your oyster!
But, as always, if you would like guidance on how to configure that oyster, please reach out in your support channel!
For a detailed tutorial on how to build a fraud model using Chalk, see here.