Chalk home page
Docs
API
CLI
  1. Overview
  2. Platform Architecture

Service Architecture

Chalk offers a hosted model (“Chalk Cloud”) and a customer-hosted model (“Customer Cloud”). Most companies choose to run Chalk in their own cloud using the Customer Cloud model. This page discusses the Customer Cloud deployment of Chalk on AWS and GCP.

A Chalk deployment consists of a Management Plane and a Data Plane:

Architecture diagram
  1. 1Creating secrets:The API server can be configured to have write-only access to the secret store.
  2. 2Reading secrets:Secret access can be restricted entirely to the data plane.
  3. 3Online store:Chalk supports several online feature stores, which are used for caching feature values. On AWS, Chalk supports DynamoDB, Elasticache, and PostgreSQL.

Management Plane

The Management Plane is responsible for storing and serving non-customer data (like alert and RBAC configurations). It orchestrates machines in the Data Plane using the Kubernetes API to do things like scaling deployments, and running batch jobs.

The Management Plane does not have access to your data. Most Chalk partners using the Customer Cloud deployment choose to have Chalk host the Metadata Plane, but in some especially sensitive applications (like FedRamp deployments), Chalk partners will also opt to host the Metadata Plane.

Data Plane

The Data Plane consists of the machines computing feature pipelines, online store, and offline store. The compute nodes run on Kubernetes (typically EKS on AWS and GKE on GCP.)

A single Data Plane can run many Chalk Environments. Often, companies will have 2-3 environments (like qa, stage, and prod.) If running in a single data plane, these environments share resources, which helps with cost and ease of setup.

However, if you prefer to have stronger isolation between Chalk Environments, each Chalk Environment can run in a separate Data Plane. You would typically run only one Metadata Plane to orchestrate all Data Planes, and deploy the Metadata Plane to the most sensitive of the environments.


Deploying a Chalk instance

Chalk is deployed with Terraform and uses common cloud primitives. Given all necessary permissions, Chalk can be deployed in about an hour. You can see sample Terraform for AWS here.

Our goal is to make deployments fit with your existing infrastructure. If you have custom needs, we are happy to customize the deployment to fit with your service architecture.


Online Computation and Serving

Architecture diagram

Let’s explore Chalk’s architecture by examining how the pieces work together to compute & serve Features online. Suppose that you want to compute a set of features for making a decision about a request from a user:

  1. Your application sends an HTTP request to Chalk’s serving API, specifying information about which features you would like to compute.
  2. Chalk’s Query Planner creates an optimized “plan” to understand which data must be fetched or transformed in order to answer your request.
  3. Chalk’s Compute Engine executes this query plan, either by fetching data directly from your data sources and running transformations on that data, or by pulling cached feature values from Online Storage.
  4. Chalk responds to your application with the features you requested.
  5. Chalk logs any newly computed features to Online and/or Offline Storage for future use.

Chalk’s online query serving platform is designed to fetch data from a variety of heterogeneous data sources and execute complex transformations on that data with the minimum possible latency. Chalk uses many techniques to reduce latency, such as:

  • Automatic parallel execution of concurrent pipeline stages
  • Vectorization of pipeline stages that are written using scalar syntax
  • Low-latency key/value stores (like Redis, BigTable, or DynamoDB) to minimize cached feature fetch time
  • Statistics-informed join planning
  • JIT transformation of Python code into native code

Offline Computation and Serving

Chalk’s architecture also supports efficient batch point-in-time queries to construct model training sets or perform batch offline inference.

  1. You submit a training data request from a notebook client like Jupyter using Chalk’s client library.
  2. Chalk’s Query Planner creates a plan to determine which data can be served from Offline Storage, and which data must be computed fresh (i.e. in the case of lazy backfills).
  3. Chalk’s Compute Engine pulls point-in-time correct feature values from Offline Storage.
  4. Chalk returns a dataframe of features to you.

Chalk’s Offline Storage is optimized for batch querying of temporally consistent data. Chalk uses columnar storage backends (Snowflake, Delta Lake, or BigQuery) to ingest massive amounts of data from a variety of data sources and query it efficiently. Note that data ingested into the Offline Store can be trivially made available for use in an online querying context using Reverse ETL.

You can see a list of supported ingestion sources in the Integrations section of these docs.


Storage

Chalk uses different storage technologies to support online and offline use cases.

The online store is optimized for serving the latest version of any given feature for any given entity with the minimum possible latency. Behind the scenes, Chalk uses key-value stores for this purpose. Chalk can be configured to use Redis or Cloud Memory Store for smaller resident data sets with high latency requirements, or DynamoDB when horizontal scalability is required.

The offline store is optimized for storing all historical feature values, serving point-in-time correct queries, and tracking provenance of features. Chalk supports a variety of storage backends depending on data scale and latency requirements. Typically, Chalk uses Snowflake, Delta Lake, or BigQuery.


Monitoring

Chalk supports not only robust monitoring of pipeline execution, but of the feature values themselves as well. Monitoring machine learning data infrastructure is just as important as monitoring application availability, but is often overlooked.

Each time a query is served, Chalk assigns a unique “trace id” for the request. Chalk tracks all emitted logs on both a per-resolver basis and a per-trace basis. This enables you to debug problems and track fine-grained performance metrics pertaining to specific features and resolvers. Leveraging this helps answer common questions such as how often certain features are computed and how long the computation takes.

Like many traditional application monitoring platforms, Chalk supports alerting on performance or availability issues via integrations with PagerDuty and Slack.

In addition to performance and request metrics for computation, Chalk supports alerting on feature values themselves. You can specify a variety of threshold requirements and drift tolerance tests to help spot issues such as:

  • Feature drift
  • Mismatches between offline/online pipelines
  • Format changes with external data sources