Infrastructure
Configure multi-region failover for high availability on AWS.
Chalk supports multi-region failover on AWS so that your feature serving infrastructure remains available even if an entire region goes down. This guide covers how to set up active-passive failover using multiple EKS clusters, resource groups, and Route 53 failover routing.
Multi-region failover works by deploying Chalk into EKS clusters in two or more AWS regions. Each region runs an independent set of Chalk services behind its own ALB. Route 53 failover routing directs traffic to the healthy region based on health checks against each ALB.
The key components are:
Work with the Chalk team to provision EKS clusters in your target regions (for example,
us-east-1 and us-west-2). Each cluster runs a complete Chalk data plane with its
own query servers, online store, and background persistence workers.
Each regional deployment is assigned to its own resource group. Resource groups provide workload isolation and give each region a unique hostname for its query servers.
For example, you might have:
| Region | Resource group | Hostname |
|---|---|---|
us-east-1 | prod-east | prod-east.query.example.com |
us-west-2 | prod-west | prod-west.query.example.com |
Each resource group’s query server exposes a /healthz endpoint that the ALB uses for
health checking. When the query server is healthy, /healthz returns a 200 response.
Create a Route 53 hosted zone with failover routing records that point to the ALB in each region:
us-east-1),
with a health check against the /healthz endpoint.us-west-2),
with its own health check.Both records share the same DNS name (e.g. chalk-failover.example.com). Route 53
resolves this name to the primary ALB when it is healthy, and automatically switches to
the secondary ALB when the primary’s health check fails.
Route 53 health checks poll the `/healthz` endpoint on each ALB. The default health check interval is 30 seconds. You can configure the failure threshold (number of consecutive failures before failover) and request interval to tune failover speed vs. sensitivity.
Configure your application to send queries through the failover DNS name using the
query_server parameter on ChalkClient:
from chalk.client import ChalkClient
client = ChalkClient(
query_server="https://chalk-failover.example.com",
)
result = client.query(
input={"user.id": 123},
output=["user.name", "user.risk_score"],
)The client does not need any awareness of which region is active. DNS resolution handles routing transparently.
Under normal operation, Route 53 resolves the failover hostname to the primary region’s ALB. All client traffic flows to the primary EKS cluster.
When the primary region becomes unhealthy:
/healthz begins failing.When the primary region recovers:
/healthz endpoint starts returning 200 again.Each region’s Chalk deployment operates independently, but the underlying data stores must be replicated across regions so the passive region can serve traffic with current feature values.
Chalk uses DynamoDB Global Tables to replicate online feature values across regions. Global Tables provide automatic, asynchronous replication so that the passive region’s online store stays up to date with writes from the active region. No application-level changes are required — both regions read and write to the same logical table.
Chalk uses Amazon MSK (Managed Streaming for Apache Kafka) to coordinate background persistence operations such as writing computed features to the online and offline stores. In a multi-region deployment, MSK must be configured for cross-region replication using MSK Replicator.
MSK Replicator continuously replicates topics from the active region’s MSK cluster to the passive region’s MSK cluster. This ensures that persistence operations initiated by the active region — such as online store writes and offline store ingestion — are also applied in the passive region. When failover occurs, the passive region’s persistence workers pick up from the replicated topic offsets.
The Chalk metadata plane uses Amazon Aurora as its sole state store. Because Aurora is the only stateful component, multi-region metadata plane availability maps directly to Aurora’s cross-region capabilities:
The choice between active-passive and active-active for the metadata plane should match the routing strategy you choose for the data plane (see Active-active vs. active-passive below).
Offline stores are typically regional. Training data queries and dataset generation should target a specific region rather than the failover hostname.
chalk apply should target both regions to keep feature definitions and resolver code in
sync. You can automate this in CI/CD by deploying to each resource group.
RPO — the maximum acceptable amount of data loss measured in time — depends on the failure scenario:
| Failure scenario | RPO | Notes |
|---|---|---|
| Single AZ failure | 0 | EKS and DynamoDB span multiple AZs within a region. No data is lost. |
| Full region failure (recoverable) | Replication lag | Determined by DynamoDB Global Tables replication lag and MSK Replicator lag, typically seconds. Once the primary region recovers, any in-flight writes that had not yet replicated are recovered from the primary. |
| Full region failure (permanent loss) | Replication lag | If the primary region is permanently lost, any writes that had not yet replicated to the passive region are lost. RPO equals the replication lag at the time of failure. |
You can monitor DynamoDB Global Tables replication lag via the `ReplicationLatency` CloudWatch metric, and MSK Replicator lag via the `ReplicationLatency` metric on your replicator. Under normal conditions both are single-digit seconds.
The architecture described above is active-passive: only one region serves live traffic at a time. Chalk also supports active-active configurations using Route 53 weighted or latency-based routing, where both regions serve traffic simultaneously. Contact the Chalk team to discuss which approach is best for your availability and latency requirements.