Overview
How it all fits together.
Chalk integrates with your data sources, transforms data with feature pipelines, stores this data in online and offline storage, and provides monitoring on feature computation and distributions.
By default, Chalk operates as a SaaS offering, where Chalk executes pipelines and stores data on its cloud infrastructure. However, Chalk can also deploy its software into your AWS or GCP project. In this model, environments are provisioned via Terraform and managed by the Chalk team.
There are a few main components of a Chalk deployment:
These components are organized as follows:
Let’s explore Chalk’s architecture by examining how the pieces work together to compute & serve Features online. Suppose that you want to compute a set of features for making a decision about a request from a user:
Chalk’s online query serving platform is designed to fetch data from a variety of heterogeneous data sources and execute complex transformations on that data with the minimum possible latency. Chalk uses many techniques to reduce latency, such as:
Chalk’s architecture also supports efficient batch point-in-time queries to construct model training sets or perform batch offline inference.
Chalk’s Offline Storage is optimized for batch querying of temporally consistent data. Chalk uses columnar storage backends (like TimescaleDB or BigQuery) to ingest massive amounts of data from a variety of data sources and query it efficiently. Note that data ingested into the Offline Store can be trivially made available for use in an online querying context using Reverse ETL.
You can see a list of supported ingestion sources in the Integrations section of these docs.
Chalk uses different storage technologies to support online and offline use cases.
The online store is optimized for serving the latest version of any given feature for any given entity with the minimum possible latency. Behind the scenes, Chalk uses key-value stores for this purpose. Chalk can be configured to use Redis or Cloud Memory Store for smaller resident data sets with high latency requirements, or DynamoDB when horizontal scalability is required.
The offline store is optimized for storing all historical feature values, serving point-in-time correct queries, and tracking provenance of features. Chalk supports a variety of storage backends depending on data scale and latency requirements. Typically, Chalk uses TimescaleDB or BigQuery.
Chalk supports not only robust monitoring of pipeline execution, but of the feature values themselves as well. Monitoring machine learning data infrastructure is just as important as monitoring application availability, but is often overlooked.
Each time a query is served, Chalk assigns a unique “trace id” for the request. Chalk tracks all emitted logs on both a per-resolver basis and a per-trace basis. This enables you to debug problems and track fine-grained performance metrics pertaining to specific features and resolvers. Leveraging this helps answer common questions such as how often certain features are computed and how long the computation takes.
Like many traditional application monitoring platforms, Chalk supports alerting on performance or availability issues via integrations with PagerDuty and Slack.
In addition to performance and request metrics for computation, Chalk supports alerting on feature values themselves. You can specify a variety of threshold requirements and drift tolerance tests to help spot issues such as: