Notebook Development
Use Chalk in your notebook of choice.
Chalk makes it really easy to write and test out new features from a notebook.
To get started, you’ll want to install and connect Chalk to your notebook environment. Generally this involves two steps:
chalkpy[runtime]
package,chalk
cli or by setting environment variables)We provide specific setup instructions for Deepnote, Hex and Vertex, but the process should be similar for other notebook environments.
To install dependencies, run the !pip install "chalkpy[runtime]"
bash command directly in a cell. Alternatively,
you can add and install the chalkpy[runtime]
dependency in your kernel’s virtual environment.
To authenticate, we recommend generating a client ID and client secret pair.
To do this, open the Chalk Dashboard. Go to Settings > Access Tokens and click on New Token, selecting all necessary permissions (generally, you’ll want: online query, offline query, and preview deployment permissions).
You can now use these access credentials to connect Chalk to any Jupyter notebook environment—we walk through this process for a few different notebook environments in our notebook setup docs.
from chalk.client import ChalkClient
import os
CHALK_CLIENT_ID = os.environ.get("CHALK_CLIENT_ID")
CHALK_CLIENT_SECRET = os.environ.get("CHALK_CLIENT_SECRET")
client = ChalkClient(
client_id=CHALK_CLIENT_ID,
client_secret=CHALK_CLIENT_SECRET,
branch='notebook',
)
To verify that everything is properly set up, you can run the following code in a cell:
client.whoami()
If everything is properly configured, you should see your user_id
, team_id
and environment_id
come back:
WhoAmIResponse(user='<user_id>', environment_id=<environment_id>, team_id='<team_id>')
Now we can start building some new features!
To load your Chalk features into your notebook, you’ll want to run:
from chalk import ChalkClient
client = ChalkClient()
client.load_features()
This will print out your available features and load them into your notebook:
Director
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Feature ┃ Type ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ id │ str │
│ name │ str │
│ birth_year │ int │
│ active │ bool │
│ favorite_numbers │ list[int] │
│ death_year │ int | None │
│ known_for_titles │ list[str] │
│ known_for_genres │ list[str] │
│ movies │ DataFrame[Movie] │
│ num_movies │ int │
└────────────────────┴───────────────────────┘
Movie
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Feature ┃ Type ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ id │ str │
│ director_id │ str │
│ release_number │ int │
│ type_ │ str │
│ title │ str │
│ original_title │ str │
│ release_date │ dt.datetime │
│ runtime_m │ float │
│ primary_genre │ str │
│ secondary_genre │ str │
└────────────────────────┴───────────────────┘
We’ll use the data schema above for the rest of this tutorial.
Before we get started writing features, you’ll want to import _
from
Chalk _
(underscore) allows you to build out features on your feature classes.
from chalk.features import _
Additionally, if you want to create and work from a new branch (which is recommended), run the following command:
client.get_or_create_branch(branch_name="<your-branch-name>")
This command will:
If the branch already exists, your client will just be updated to point to that branch (no new branch deployment will be created).
Lets say we want to write a new “exclaimed name” feature, which is a director’s name with an exclamation mark tacked on. To do this, run the following code in a cell:
Director.name_exclaimed = _.name + "!"
In this example, the underscore acts as a reference to the Director
feature class`.
We have now created a new feature in our notebook and can query for it in either an online or offline context.
To query for your new features, pass them as outputs to the ChalkClient
query methods. For instance, to run an
online query for your new Director.name_exclaimed
feature, you can run the following:
new_features = client.query(
input={
Director.id: 113,
},
output={
Director.name,
Director.name_exclaimed,
}
)
new_features
Running an offline query is equally straight forward:
chalk_dataset = client.offline_query(
input={
Director.id: list(range(1000)),
},
output={
Director.id,
Director.name,
Director.name_exclaimed,
},
recompute_features=True,
run_asynchronously=True,
dataset_name="<dataset_name>",
)
df = chalk_dataset.to_pandas()
Running the offline query above creates a Pandas DataFrame
with the columns Director.id
, Director.name
, and Director.name_exclaimed
(alongside
some metadata columns). This DataFrame
will have one row for every Directors with an id between 0 and 999.
By providing a dataset_name
to the offline_query, we’ve created a named dataset. Named datasets can easily be shared with other team members.
Any historical dataset can be accessed using the ChalkClient
. Named datasets can be accessed as follows:
client = ChalkClient()
dataset = client.get_dataset(name="<datset_name>")
You can use expressions to build out more complex features. Generally these more complex features are composed of: has-one relationships, has-many relationships, and chalk functions.
In the schema we’ve defined above, a Movie is in a has-one relationship with its Director (even if this is not necessarily true in practice).
When building new features on the Movie class, features from joined classes can be directly accessed. For instance, lets say we want to define Movie.is_latest_release
.
We can accomplish this by defining a feature that reaches through the has_one join:
Movie.is_latest_release= _.director.num_movies == _.release_number
The expression above defines a boolean feature which tests whether the release number for a movie (_.release_number
) is equal to the number of movies a director
has released (_.director.num_movies
).
In the schema we’ve defined above, a Director is in a has-many relationship with their Movies.
When building new features on the Director class, features from joined classes can be directly accessed. For instance, lets say we want to define Director.average_movie_runtime
. We can accomplish this by defining an aggregation feature:
Director.average_movie_runtime = _.movies[_.runtime].mean()
In the above example, the two different _
’s refer to two different feature class namespaces:
_.movies
can be thought of as Director.movies
,_.runtime
can be thought of as Movie.runtime
.The expression above gets a directors movies (_.movies
), specifies that the aggregation is happening on the
runtime field of those movies (_.movies[_.runtime]
) and then calculates the mean (.mean()
).
Filters can also be passed to these expressions. For instance, say we want to filter out any movie
that has a primary or secondary genre of short
from the average runtime. This can be done as follows:
Director.average_movie_runtime_filtered = _.movies[
_.runtime, # aggregation column
_.primary_genre != "short", # filter
_.secondary_genre != "short" # additional filter
].mean()
Chalk exposes a number of functions that can be called and composed as part of expressions—a full list of these functions can be found on our expressions page.
Below, we provide a few examples of how these functions can be used to create new features:
Identify whether a movie belongs to a Director’s typical genres:
Movie.is_directors_usual_genre = F.contains(_.director.genres, _.primary_genre)
Identify whether a movie was released on a US federal holiday:
Movie.released_on_holiday = F.is_us_federal_holiday(_.release_daate)
Get release_date of Director’s last movie:
Director.latest_movie_release = F.max_by(_.movies[_.release_date], _.release_date)
The features you’ve defined in your notebook can also be composed—they can be used to define new features.
For instance, you can define the Director.makes_movies_that_are_too_long
feature:
Director.makes_movies_that_are_too_long = _.average_movie_runtime > 150
Defining features as these expressions provides a couple really significant advantages: