# Multi-Region Failover
source: https://docs.chalk.ai/docs/multi-region-failover

## Configure multi-region failover for high availability on AWS.

Chalk supports multi-region failover on AWS so that your feature serving infrastructure
remains available even if an entire region goes down. This guide covers how to set up
active-passive failover using multiple EKS clusters, resource groups, and Route 53
failover routing.

### Architecture overview

Multi-region failover works by deploying Chalk into EKS clusters in two or more AWS
regions. Each region runs an independent set of Chalk services behind its own ALB.
Route 53 failover routing directs traffic to the healthy region based on health checks
against each ALB.

The key components are:

- Multiple EKS clusters -- one per region, each running a full Chalk data plane.
- Resource groups -- each region's Chalk deployment is assigned to a distinct resource group with a unique hostname.
- Route 53 failover routing -- a single DNS name resolves to the healthy region's ALB, with automatic failover when health checks fail.

### Setting up multi-region failover

### 1. Create EKS clusters in each region

Work with the Chalk team to provision EKS clusters in your target regions (for example,
us-east-1 and us-west-2). Each cluster runs a complete Chalk data plane with its
own query servers, online store, and background persistence workers.

### 2. Configure resource groups

Each regional deployment is assigned to its own resource group.
Resource groups provide workload isolation and give each region a unique hostname for
its query servers.

For example, you might have:

| Region      | Resource group | Hostname                      |
| ----------- | -------------- | ----------------------------- |
| `us-east-1` | `prod-east`    | `prod-east.query.example.com` |
| `us-west-2` | `prod-west`    | `prod-west.query.example.com` |

Each resource group's query server exposes a /healthz endpoint that the ALB uses for
health checking. When the query server is healthy, /healthz returns a 200 response.

### 3. Configure Route 53 failover routing

Create a Route 53 hosted zone with failover routing records that point to the ALB in
each region:

- Primary record -- points to the ALB in your primary region (e.g. us-east-1),
with a health check against the /healthz endpoint.
- Secondary record -- points to the ALB in your secondary region (e.g. us-west-2),
with its own health check.

Both records share the same DNS name (e.g. chalk-failover.example.com). Route 53
resolves this name to the primary ALB when it is healthy, and automatically switches to
the secondary ALB when the primary's health check fails.

Route 53 health checks poll the /healthz endpoint on each ALB. The default health check
interval is 30 seconds. You can configure the failure threshold (number of consecutive
failures before failover) and request interval to tune failover speed vs. sensitivity.

### 4. Point clients at the failover hostname

Configure your application to send queries through the failover DNS name using the
query_server parameter on ChalkClient:

```
from chalk.client import ChalkClient

client = ChalkClient(
    query_server="https://chalk-failover.example.com",
)

result = client.query(
    input={"user.id": 123},
    output=["user.name", "user.risk_score"],
)
```

The client does not need any awareness of which region is active. DNS resolution
handles routing transparently.

### How failover works

Under normal operation, Route 53 resolves the failover hostname to the primary region's
ALB. All client traffic flows to the primary EKS cluster.

When the primary region becomes unhealthy:

- The ALB health check against /healthz begins failing.
- Route 53 detects consecutive health check failures (typically within 1-2 minutes).
- Route 53 updates DNS to resolve the failover hostname to the secondary region's ALB.
- Client traffic is routed to the secondary EKS cluster.

When the primary region recovers:

- The /healthz endpoint starts returning 200 again.
- Route 53 detects the recovery and switches DNS back to the primary region.

Failover is DNS-based, so clients must respect DNS TTLs. Set a low TTL (e.g. 60 seconds)
on the failover record to minimize the window during which clients may still resolve to
the unhealthy region.

### Data replication

Each region's Chalk deployment operates independently, but the underlying data stores
must be replicated across regions so the passive region can serve traffic with current
feature values.

### Online store -- DynamoDB Global Tables

Chalk uses DynamoDB Global Tables
to replicate online feature values across regions. Global Tables provide automatic,
asynchronous replication so that the passive region's online store stays up to date
with writes from the active region. No application-level changes are required -- both
regions read and write to the same logical table.

### Online store -- ElastiCache Valkey Global Datastore

If you use ElastiCache Valkey as your online
store instead of DynamoDB, cross-region replication is handled by
ElastiCache Global Datastore.

Global Datastore creates a fully managed, active-passive replication topology across
regions:

- Primary cluster -- the ElastiCache Valkey cluster in your primary region accepts
all writes. Chalk's persistence workers in the active region write computed feature
values here.
- Secondary cluster -- a read-only replica cluster in the secondary region receives
writes asynchronously from the primary. Replication is continuous and typically
completes within single-digit milliseconds under normal network conditions.

When a region failover occurs, the secondary cluster must be promoted to become the
new primary before it can accept writes. ElastiCache supports managed promotion through
the AWS console or API, and the process typically completes within minutes.

### How async replication works

ElastiCache Global Datastore uses the Valkey replication stream to replicate data
cross-region:

- Writes land on the primary cluster's primary node as normal Valkey commands.
- The replication stream is forwarded over an AWS-managed, encrypted cross-region
link to the secondary cluster.
- The secondary cluster applies the replication stream to its own dataset, keeping it
eventually consistent with the primary.

Because replication is asynchronous, there is a small window (typically under 1 ms for
in-region replication and low single-digit milliseconds cross-region) where the secondary
may be behind the primary. In a failover scenario, any writes that had not yet been
replicated to the secondary at the moment of failure may be lost.

Monitor cross-region replication health with the ReplicationLag CloudWatch metric on
your Global Datastore. Under normal conditions this stays below 1 ms for in-region
replicas, and low single-digit milliseconds for cross-region replicas.

### Failover behavior

During a region failover:

- Route 53 detects the primary region's health check failure and switches DNS to the
secondary region.
- You (or an automated runbook) promote the secondary ElastiCache cluster to primary
using the
Global Datastore failover API.
- Chalk's query servers in the secondary region begin reading from the now-promoted
cluster. Persistence workers in the secondary region begin writing new feature values
to it.

Unlike DynamoDB Global Tables, ElastiCache Global Datastore requires an explicit
promotion step during failover. Until the secondary cluster is promoted, it remains
read-only. Plan for this in your failover runbook or automation.

### Configuration

Work with the Chalk team to provision ElastiCache Global Datastore across your target
regions. Each region's Chalk deployment should be configured with a connection URI
pointing to the local ElastiCache cluster endpoint, so reads are always served from
the nearest region. See the online store setup guide
for connection URI format details.

### Persistence coordination -- MSK Replicator

Chalk uses Amazon MSK (Managed Streaming for Apache Kafka)
to coordinate background persistence operations such as writing computed features to the
online and offline stores. In a multi-region deployment, MSK must be configured for
cross-region replication using
MSK Replicator.

MSK Replicator continuously replicates topics from the active region's MSK cluster to
the passive region's MSK cluster. This ensures that persistence operations initiated by
the active region -- such as online store writes and offline store ingestion -- are also
applied in the passive region. When failover occurs, the passive region's persistence
workers pick up from the replicated topic offsets.

### Metadata plane -- RDS Aurora

The Chalk metadata plane uses
Amazon Aurora as its sole state store. Because
Aurora is the only stateful component, multi-region metadata plane availability maps
directly to Aurora's cross-region capabilities:

- Active-passive -- use an
Aurora Global Database
with a read replica in the secondary region. Aurora handles asynchronous replication
and supports managed failover to promote the secondary cluster when the primary
region is unavailable.
- Active-active -- use
Aurora Global Database write forwarding
or an Aurora multi-master configuration so that both regions can accept writes.

The choice between active-passive and active-active for the metadata plane should match
the routing strategy you choose for the data plane (see
Active-active vs. active-passive below).

### Offline store

Offline stores are typically regional. Training data queries and dataset generation should
target a specific region rather than the failover hostname.

### Deployments

chalk apply should target both regions to keep feature definitions and resolver code in
sync. You can automate this in CI/CD by deploying to each resource group.

### Recovery point objective (RPO)

RPO -- the maximum acceptable amount of data loss measured in time -- depends on the
failure scenario:

| Failure scenario                     | RPO                 | Notes                                                                                                                                                                                                                                                                    |
| ------------------------------------ | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Single AZ failure                    | **0**               | EKS, DynamoDB, and ElastiCache span multiple AZs within a region. No data is lost.                                                                                                                                                                                       |
| Full region failure (recoverable)    | **Replication lag** | Determined by your online store's replication lag (DynamoDB Global Tables or ElastiCache Global Datastore) and MSK Replicator lag, typically seconds. Once the primary region recovers, any in-flight writes that had not yet replicated are recovered from the primary. |
| Full region failure (permanent loss) | **Replication lag** | If the primary region is permanently lost, any writes that had not yet replicated to the passive region are lost. RPO equals the replication lag at the time of failure.                                                                                                 |

You can monitor replication lag via CloudWatch: ReplicationLatency for DynamoDB
Global Tables, ReplicationLag for ElastiCache Global Datastore, and
ReplicationLatency for MSK Replicator. Under normal conditions all are single-digit
seconds or less.

### Active-active vs. active-passive

The architecture described above is active-passive: only one region serves live
traffic at a time. Chalk also supports active-active configurations using Route 53
weighted or latency-based routing, where both regions serve traffic simultaneously.
Contact the Chalk team to discuss which approach is best for your availability and
latency requirements.