Chalk supports multi-region failover on AWS so that your feature serving infrastructure remains available even if an entire region goes down. This guide covers how to set up active-passive failover using multiple EKS clusters, resource groups, and Route 53 failover routing.


Architecture overview

Multi-region failover works by deploying Chalk into EKS clusters in two or more AWS regions. Each region runs an independent set of Chalk services behind its own ALB. Route 53 failover routing directs traffic to the healthy region based on health checks against each ALB.

The key components are:

  1. Multiple EKS clusters — one per region, each running a full Chalk data plane.
  2. Resource groups — each region’s Chalk deployment is assigned to a distinct resource group with a unique hostname.
  3. Route 53 failover routing — a single DNS name resolves to the healthy region’s ALB, with automatic failover when health checks fail.

Setting up multi-region failover

1. Create EKS clusters in each region

Work with the Chalk team to provision EKS clusters in your target regions (for example, us-east-1 and us-west-2). Each cluster runs a complete Chalk data plane with its own query servers, online store, and background persistence workers.

2. Configure resource groups

Each regional deployment is assigned to its own resource group. Resource groups provide workload isolation and give each region a unique hostname for its query servers.

For example, you might have:

RegionResource groupHostname
us-east-1prod-eastprod-east.query.example.com
us-west-2prod-westprod-west.query.example.com

Each resource group’s query server exposes a /healthz endpoint that the ALB uses for health checking. When the query server is healthy, /healthz returns a 200 response.

3. Configure Route 53 failover routing

Create a Route 53 hosted zone with failover routing records that point to the ALB in each region:

  • Primary record — points to the ALB in your primary region (e.g. us-east-1), with a health check against the /healthz endpoint.
  • Secondary record — points to the ALB in your secondary region (e.g. us-west-2), with its own health check.

Both records share the same DNS name (e.g. chalk-failover.example.com). Route 53 resolves this name to the primary ALB when it is healthy, and automatically switches to the secondary ALB when the primary’s health check fails.

Route 53 health checks poll the `/healthz` endpoint on each ALB. The default health check interval is 30 seconds. You can configure the failure threshold (number of consecutive failures before failover) and request interval to tune failover speed vs. sensitivity.

4. Point clients at the failover hostname

Configure your application to send queries through the failover DNS name using the query_server parameter on ChalkClient:

from chalk.client import ChalkClient

client = ChalkClient(
    query_server="https://chalk-failover.example.com",
)

result = client.query(
    input={"user.id": 123},
    output=["user.name", "user.risk_score"],
)

The client does not need any awareness of which region is active. DNS resolution handles routing transparently.


How failover works

Under normal operation, Route 53 resolves the failover hostname to the primary region’s ALB. All client traffic flows to the primary EKS cluster.

When the primary region becomes unhealthy:

  1. The ALB health check against /healthz begins failing.
  2. Route 53 detects consecutive health check failures (typically within 1-2 minutes).
  3. Route 53 updates DNS to resolve the failover hostname to the secondary region’s ALB.
  4. Client traffic is routed to the secondary EKS cluster.

When the primary region recovers:

  1. The /healthz endpoint starts returning 200 again.
  2. Route 53 detects the recovery and switches DNS back to the primary region.
Warning
Failover is DNS-based, so clients must respect DNS TTLs. Set a low TTL (e.g. 60 seconds) on the failover record to minimize the window during which clients may still resolve to the unhealthy region.

Data replication

Each region’s Chalk deployment operates independently, but the underlying data stores must be replicated across regions so the passive region can serve traffic with current feature values.

Online store -- DynamoDB Global Tables

Chalk uses DynamoDB Global Tables to replicate online feature values across regions. Global Tables provide automatic, asynchronous replication so that the passive region’s online store stays up to date with writes from the active region. No application-level changes are required — both regions read and write to the same logical table.

Persistence coordination -- MSK Replicator

Chalk uses Amazon MSK (Managed Streaming for Apache Kafka) to coordinate background persistence operations such as writing computed features to the online and offline stores. In a multi-region deployment, MSK must be configured for cross-region replication using MSK Replicator.

MSK Replicator continuously replicates topics from the active region’s MSK cluster to the passive region’s MSK cluster. This ensures that persistence operations initiated by the active region — such as online store writes and offline store ingestion — are also applied in the passive region. When failover occurs, the passive region’s persistence workers pick up from the replicated topic offsets.

Metadata plane -- RDS Aurora

The Chalk metadata plane uses Amazon Aurora as its sole state store. Because Aurora is the only stateful component, multi-region metadata plane availability maps directly to Aurora’s cross-region capabilities:

  • Active-passive — use an Aurora Global Database with a read replica in the secondary region. Aurora handles asynchronous replication and supports managed failover to promote the secondary cluster when the primary region is unavailable.
  • Active-active — use Aurora Global Database write forwarding or an Aurora multi-master configuration so that both regions can accept writes.

The choice between active-passive and active-active for the metadata plane should match the routing strategy you choose for the data plane (see Active-active vs. active-passive below).

Offline store

Offline stores are typically regional. Training data queries and dataset generation should target a specific region rather than the failover hostname.

Deployments

chalk apply should target both regions to keep feature definitions and resolver code in sync. You can automate this in CI/CD by deploying to each resource group.


Recovery point objective (RPO)

RPO — the maximum acceptable amount of data loss measured in time — depends on the failure scenario:

Failure scenarioRPONotes
Single AZ failure0EKS and DynamoDB span multiple AZs within a region. No data is lost.
Full region failure (recoverable)Replication lagDetermined by DynamoDB Global Tables replication lag and MSK Replicator lag, typically seconds. Once the primary region recovers, any in-flight writes that had not yet replicated are recovered from the primary.
Full region failure (permanent loss)Replication lagIf the primary region is permanently lost, any writes that had not yet replicated to the passive region are lost. RPO equals the replication lag at the time of failure.

You can monitor DynamoDB Global Tables replication lag via the `ReplicationLatency` CloudWatch metric, and MSK Replicator lag via the `ReplicationLatency` metric on your replicator. Under normal conditions both are single-digit seconds.


Active-active vs. active-passive

The architecture described above is active-passive: only one region serves live traffic at a time. Chalk also supports active-active configurations using Route 53 weighted or latency-based routing, where both regions serve traffic simultaneously. Contact the Chalk team to discuss which approach is best for your availability and latency requirements.