Multi-Region Failover

Chalk supports multi-region failover on AWS so that your feature serving infrastructure remains available even if an entire region goes down. This guide covers how to set up active-passive failover using multiple EKS clusters, resource groups, and Route 53 failover routing.

Architecture overview

Multi-region failover works by deploying Chalk into EKS clusters in two or more AWS regions. Each region runs an independent set of Chalk services behind its own ALB. Route 53 failover routing directs traffic to the healthy region based on health checks against each ALB.

The key components are:

Multiple EKS clusters — one per region, each running a full Chalk data plane.
Resource groups — each region’s Chalk deployment is assigned to a distinct resource group with a unique hostname.
Route 53 failover routing — a single DNS name resolves to the healthy region’s ALB, with automatic failover when health checks fail.

Setting up multi-region failover

1. Create EKS clusters in each region

Work with the Chalk team to provision EKS clusters in your target regions (for example, us-east-1 and us-west-2). Each cluster runs a complete Chalk data plane with its own query servers, online store, and background persistence workers.

2. Configure resource groups

Each regional deployment is assigned to its own resource group. Resource groups provide workload isolation and give each region a unique hostname for its query servers.

For example, you might have:

Region	Resource group	Hostname
`us-east-1`	`prod-east`	`prod-east.query.example.com`
`us-west-2`	`prod-west`	`prod-west.query.example.com`

Each resource group’s query server exposes a /healthz endpoint that the ALB uses for health checking. When the query server is healthy, /healthz returns a 200 response.

3. Configure Route 53 failover routing

Create a Route 53 hosted zone with failover routing records that point to the ALB in each region:

Primary record — points to the ALB in your primary region (e.g. us-east-1), with a health check against the /healthz endpoint.
Secondary record — points to the ALB in your secondary region (e.g. us-west-2), with its own health check.

Both records share the same DNS name (e.g. chalk-failover.example.com). Route 53 resolves this name to the primary ALB when it is healthy, and automatically switches to the secondary ALB when the primary’s health check fails.

Route 53 health checks poll the `/healthz` endpoint on each ALB. The default health check interval is 30 seconds. You can configure the failure threshold (number of consecutive failures before failover) and request interval to tune failover speed vs. sensitivity.

4. Point clients at the failover hostname

Configure your application to send queries through the failover DNS name using the query_server parameter on ChalkClient:

from chalk.client import ChalkClient

client = ChalkClient(
    query_server="https://chalk-failover.example.com",
)

result = client.query(
    input={"user.id": 123},
    output=["user.name", "user.risk_score"],
)

The client does not need any awareness of which region is active. DNS resolution handles routing transparently.

How failover works

Under normal operation, Route 53 resolves the failover hostname to the primary region’s ALB. All client traffic flows to the primary EKS cluster.

When the primary region becomes unhealthy:

The ALB health check against /healthz begins failing.
Route 53 detects consecutive health check failures (typically within 1-2 minutes).
Route 53 updates DNS to resolve the failover hostname to the secondary region’s ALB.
Client traffic is routed to the secondary EKS cluster.

When the primary region recovers:

The /healthz endpoint starts returning 200 again.
Route 53 detects the recovery and switches DNS back to the primary region.

Warning

Failover is DNS-based, so clients must respect DNS TTLs. Set a low TTL (e.g. 60 seconds) on the failover record to minimize the window during which clients may still resolve to the unhealthy region.

Data replication

Each region’s Chalk deployment operates independently, but the underlying data stores must be replicated across regions so the passive region can serve traffic with current feature values.

Online store -- DynamoDB Global Tables

Chalk uses DynamoDB Global Tables to replicate online feature values across regions. Global Tables provide automatic, asynchronous replication so that the passive region’s online store stays up to date with writes from the active region. No application-level changes are required — both regions read and write to the same logical table.

Online store -- ElastiCache Valkey Global Datastore

If you use ElastiCache Valkey as your online store instead of DynamoDB, cross-region replication is handled by ElastiCache Global Datastore.

Global Datastore creates a fully managed, active-passive replication topology across regions:

Primary cluster — the ElastiCache Valkey cluster in your primary region accepts all writes. Chalk’s persistence workers in the active region write computed feature values here.
Secondary cluster — a read-only replica cluster in the secondary region receives writes asynchronously from the primary. Replication is continuous and typically completes within single-digit milliseconds under normal network conditions.

When a region failover occurs, the secondary cluster must be promoted to become the new primary before it can accept writes. ElastiCache supports managed promotion through the AWS console or API, and the process typically completes within minutes.

How async replication works

ElastiCache Global Datastore uses the Valkey replication stream to replicate data cross-region:

Writes land on the primary cluster’s primary node as normal Valkey commands.
The replication stream is forwarded over an AWS-managed, encrypted cross-region link to the secondary cluster.
The secondary cluster applies the replication stream to its own dataset, keeping it eventually consistent with the primary.

Because replication is asynchronous, there is a small window (typically under 1 ms for in-region replication and low single-digit milliseconds cross-region) where the secondary may be behind the primary. In a failover scenario, any writes that had not yet been replicated to the secondary at the moment of failure may be lost.

Monitor cross-region replication health with the `ReplicationLag` CloudWatch metric on your Global Datastore. Under normal conditions this stays below 1 ms for in-region replicas, and low single-digit milliseconds for cross-region replicas.

Failover behavior

During a region failover:

Route 53 detects the primary region’s health check failure and switches DNS to the secondary region.
You (or an automated runbook) promote the secondary ElastiCache cluster to primary using the Global Datastore failover API.
Chalk’s query servers in the secondary region begin reading from the now-promoted cluster. Persistence workers in the secondary region begin writing new feature values to it.

Warning

Unlike DynamoDB Global Tables, ElastiCache Global Datastore requires an explicit promotion step during failover. Until the secondary cluster is promoted, it remains read-only. Plan for this in your failover runbook or automation.

Configuration

Work with the Chalk team to provision ElastiCache Global Datastore across your target regions. Each region’s Chalk deployment should be configured with a connection URI pointing to the local ElastiCache cluster endpoint, so reads are always served from the nearest region. See the online store setup guide for connection URI format details.

Persistence coordination -- MSK Replicator

Chalk uses Amazon MSK (Managed Streaming for Apache Kafka) to coordinate background persistence operations such as writing computed features to the online and offline stores. In a multi-region deployment, MSK must be configured for cross-region replication using MSK Replicator.

MSK Replicator continuously replicates topics from the active region’s MSK cluster to the passive region’s MSK cluster. This ensures that persistence operations initiated by the active region — such as online store writes and offline store ingestion — are also applied in the passive region. When failover occurs, the passive region’s persistence workers pick up from the replicated topic offsets.

Metadata plane -- RDS Aurora

The Chalk metadata plane uses Amazon Aurora as its sole state store. Because Aurora is the only stateful component, multi-region metadata plane availability maps directly to Aurora’s cross-region capabilities:

Active-passive — use an Aurora Global Database with a read replica in the secondary region. Aurora handles asynchronous replication and supports managed failover to promote the secondary cluster when the primary region is unavailable.
Active-active — use Aurora Global Database write forwarding or an Aurora multi-master configuration so that both regions can accept writes.

The choice between active-passive and active-active for the metadata plane should match the routing strategy you choose for the data plane (see Active-active vs. active-passive below).

Offline store

Offline stores are typically regional. Training data queries and dataset generation should target a specific region rather than the failover hostname.

Deployments

chalk apply should target both regions to keep feature definitions and resolver code in sync. You can automate this in CI/CD by deploying to each resource group.

Recovery point objective (RPO)

RPO — the maximum acceptable amount of data loss measured in time — depends on the failure scenario:

Failure scenario	RPO	Notes
Single AZ failure	0	EKS, DynamoDB, and ElastiCache span multiple AZs within a region. No data is lost.
Full region failure (recoverable)	Replication lag	Determined by your online store’s replication lag (DynamoDB Global Tables or ElastiCache Global Datastore) and MSK Replicator lag, typically seconds. Once the primary region recovers, any in-flight writes that had not yet replicated are recovered from the primary.
Full region failure (permanent loss)	Replication lag	If the primary region is permanently lost, any writes that had not yet replicated to the passive region are lost. RPO equals the replication lag at the time of failure.

You can monitor replication lag via CloudWatch: `ReplicationLatency` for DynamoDB Global Tables, `ReplicationLag` for ElastiCache Global Datastore, and `ReplicationLatency` for MSK Replicator. Under normal conditions all are single-digit seconds or less.

Active-active vs. active-passive

The architecture described above is active-passive: only one region serves live traffic at a time. Chalk also supports active-active configurations using Route 53 weighted or latency-based routing, where both regions serve traffic simultaneously. Contact the Chalk team to discuss which approach is best for your availability and latency requirements.

​Architecture overview

​Setting up multi-region failover

​1. Create EKS clusters in each region

​2. Configure resource groups

​3. Configure Route 53 failover routing

​4. Point clients at the failover hostname

​How failover works

​Data replication

​Online store -- DynamoDB Global Tables

​Online store -- ElastiCache Valkey Global Datastore

​How async replication works

​Failover behavior

​Configuration

​Persistence coordination -- MSK Replicator

​Metadata plane -- RDS Aurora

​Offline store

​Deployments

​Recovery point objective (RPO)

​Active-active vs. active-passive

On this page