Overview

Chalk supports DynamoDB as an online store. Online query results and cached feature values are written to DynamoDB by the background persistence writers and read directly by the query servers. This page covers how to size DynamoDB capacity, how to choose between single-region and multi-region deployments, and how to provision everything via Terraform.

Chalk’s DynamoDB online store uses a single table per environment with feature keys encoded to minimize both storage and capacity consumption: values are stored using native DynamoDB data types (not JSON-encoded strings), and feature names are compressed to short stable identifiers. This means that DynamoDB capacity sizing in practice consumes noticeably less WCU/RCU than a naive estimate based on the raw JSON size of a feature set would suggest.


DynamoDB vs. Valkey/Redis

Chalk supports both DynamoDB and Valkey (or Redis) as online stores. The right choice depends on your workload:

  • DynamoDB is the better fit when you have a large working set of feature values with modest per-query storage requirements. Because DynamoDB is a managed disk-backed store, it can cost-effectively hold billions of entities without the memory pressure that would dominate a Valkey deployment. It also requires no capacity planning for replication or failover beyond the WCU/RCU dimensions.
  • Valkey/Redis is the better fit for ultra-low-latency workloads and for workloads where the working set is small enough to fit in memory. In-memory reads are meaningfully faster than DynamoDB’s single-digit-millisecond reads, and small Valkey deployments are typically cheaper than an equivalent DynamoDB configuration.

A common pattern is DynamoDB with an LRU cache and/or a Bloom filter to minimize cache reads.


Sizing WCU and RCU

Chalk’s DynamoDB encoding (native dtypes + short feature identifiers) keeps per-item payloads small, so RCU/WCU calculations are typically driven by query volume and the number of features read per query rather than by raw payload size.

A useful starting point:

  • RCU — each online query consumes roughly one RCU per entity read (entities typically fit within the 4 KiB read unit, even with dozens of features). Multiply expected QPS by the average number of entities loaded per query. Use eventually consistent reads unless you have a specific reason to pay 2x for strongly consistent reads; Chalk does not require strong consistency.
  • WCU — each persisted query result consumes roughly one WCU per entity written (again, entities typically fit within the 1 KiB write unit). Multiply expected QPS by the fraction of queries whose results are persisted to the online store and by the average number of entities written per query.

Chalk will assist with initial sizing based on your query mix, but the customer is ultimately responsible for choosing and tuning DynamoDB capacity: DynamoDB capacity is a direct cost driver, and the tradeoffs between provisioned, on-demand, and autoscaled capacity are workload-specific and owned by the customer.

Provisioned vs. on-demand vs. autoscaled

DynamoDB offers three capacity modes, each with different cost and operational characteristics:

  • Provisioned (static) — you pay for a fixed WCU/RCU amount regardless of utilization. Cheapest per unit for steady-state workloads where utilization is consistently high, but throttles immediately when traffic exceeds the provisioned level. Appropriate when you have a well-understood, relatively flat traffic pattern.
  • Provisioned with autoscaling — capacity tracks a target utilization (typically 70%). AWS Application Auto Scaling adjusts WCU/RCU in response to CloudWatch metrics. Scale-up is reactive (there is a lag, typically minutes) and scale-down has a cooldown, so autoscaling accommodates gradual traffic shifts well but can still throttle on sharp spikes.
  • On-demand — no capacity planning; you pay per request. Roughly 7x the per-unit cost of steady-state provisioned capacity, but absorbs arbitrary traffic spikes instantly. Appropriate for bursty, unpredictable traffic where throttling is not acceptable.

For most production Chalk deployments, provisioned-with-autoscaling is the right default: it amortizes the steady-state cost advantage of provisioned capacity while still absorbing diurnal traffic variation. Reserve on-demand for environments with highly unpredictable traffic or very low steady-state utilization.


Single-region vs. multi-region

Chalk supports DynamoDB online stores in either a single region or replicated across multiple regions using DynamoDB Global Tables.

Single-region

A single-region deployment is the simplest configuration: one table in one region, accessed by Chalk query servers running in the same region. If the region becomes unavailable, the online store is unavailable and online queries will fail until the region recovers. Single-region is appropriate when your application’s availability requirements do not extend beyond a single AWS region.

Multi-region

Global Tables replicate items asynchronously between regions, typically with sub-second propagation under normal conditions. Chalk recommends asynchronous replication (the default for Global Tables) rather than attempting to build strongly consistent cross-region writes: synchronous cross-region replication would require every online write to commit in at least two regions before returning, which is prohibitively expensive in both latency (adding one inter-region round trip per write) and cost.

Because replication is asynchronous, a regional failover can lose the last few seconds of writes that had not yet replicated from the lost primary region. In practice, this achieves RPO < 1 minute: the write lag for Global Tables is typically under 1 second during normal operation, and even under regional stress has historically stayed well below a minute. For Chalk online queries, the practical effect of this RPO is that a small number of the most recently persisted query results may be missing after failover, forcing re-computation on the next query; feature values themselves are not corrupted.

The tradeoff: accept a small RPO in exchange for (a) much lower write latency, (b) lower cost, and (c) a simpler operational model. Applications that cannot tolerate any lost writes must use a different persistence model than an online feature store.

See Multi-Region Failover for the Chalk-level configuration that steers query traffic to a healthy region.


Shared responsibility

Chalk will assist with DynamoDB sizing, capacity-mode selection, and replication topology, but the customer is ultimately responsible for provisioning and operating the DynamoDB table. This is intentional: DynamoDB capacity is a direct cost driver that the customer controls, and capacity decisions must be made against the customer’s own cost model and availability targets.

Chalk’s responsibilities are:

  • Advising on initial sizing and recommending capacity mode for a given workload
  • Operating the Chalk components that read from and write to DynamoDB
  • Reporting online store error rates, throttle rates, and latency in the Chalk UI

Customer responsibilities are:

  • Provisioning the DynamoDB table and any Global Table replicas
  • Choosing provisioned / on-demand / autoscaled capacity
  • Tuning autoscaling targets and floor/ceiling values
  • Configuring IAM access for the Chalk service account

Example Terraform: single-region

A single-region DynamoDB table with provisioned capacity:

resource "aws_dynamodb_table" "chalk_online_store" {
  name         = "chalk-online-store"
  billing_mode = "PROVISIONED"

  read_capacity  = 1000
  write_capacity = 500

  hash_key  = "pk"
  range_key = "sk"

  attribute {
    name = "pk"
    type = "S"
  }

  attribute {
    name = "sk"
    type = "S"
  }

  point_in_time_recovery {
    enabled = true
  }

  server_side_encryption {
    enabled = true
  }

  tags = {
    chalk_environment = "production"
  }
}

Switch billing_mode to PAY_PER_REQUEST and remove the read_capacity / write_capacity fields for on-demand.


Example Terraform: multi-region (Global Tables)

A multi-region deployment uses a single aws_dynamodb_table resource with replica blocks. Global Tables require stream_enabled = true and stream_view_type = "NEW_AND_OLD_IMAGES":

resource "aws_dynamodb_table" "chalk_online_store" {
  name             = "chalk-online-store"
  billing_mode     = "PROVISIONED"
  read_capacity    = 1000
  write_capacity   = 500
  hash_key         = "pk"
  range_key        = "sk"
  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"

  attribute {
    name = "pk"
    type = "S"
  }

  attribute {
    name = "sk"
    type = "S"
  }

  replica {
    region_name = "us-east-1"
  }

  replica {
    region_name = "us-west-2"
  }

  point_in_time_recovery {
    enabled = true
  }

  server_side_encryption {
    enabled = true
  }

  tags = {
    chalk_environment = "production"
  }
}

Each replica is a full copy of the table in the specified region; replication is asynchronous with typical lag well under a second. Provisioned capacity applies per-region and must be sized for each region’s local traffic.


Example Terraform: autoscaling policy

An autoscaling policy tracks target utilization on read and write capacity. Attach one pair of scalable targets and policies per capacity dimension:

resource "aws_appautoscaling_target" "read_target" {
  max_capacity       = 5000
  min_capacity       = 500
  resource_id        = "table/${aws_dynamodb_table.chalk_online_store.name}"
  scalable_dimension = "dynamodb:table:ReadCapacityUnits"
  service_namespace  = "dynamodb"
}

resource "aws_appautoscaling_policy" "read_policy" {
  name               = "chalk-online-store-read-autoscaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.read_target.resource_id
  scalable_dimension = aws_appautoscaling_target.read_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.read_target.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "DynamoDBReadCapacityUtilization"
    }
    target_value       = 70.0
    scale_in_cooldown  = 60
    scale_out_cooldown = 60
  }
}

resource "aws_appautoscaling_target" "write_target" {
  max_capacity       = 2500
  min_capacity       = 250
  resource_id        = "table/${aws_dynamodb_table.chalk_online_store.name}"
  scalable_dimension = "dynamodb:table:WriteCapacityUnits"
  service_namespace  = "dynamodb"
}

resource "aws_appautoscaling_policy" "write_policy" {
  name               = "chalk-online-store-write-autoscaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.write_target.resource_id
  scalable_dimension = aws_appautoscaling_target.write_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.write_target.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "DynamoDBWriteCapacityUtilization"
    }
    target_value       = 70.0
    scale_in_cooldown  = 60
    scale_out_cooldown = 60
  }
}

A 70% target utilization is a conservative starting point that leaves headroom for the reactive scale-up delay. For workloads with sharper traffic spikes, lower the target to 50-60% or raise min_capacity so that the floor already covers expected peak-to-trough variation. For Global Tables, configure autoscaling independently in each region.