Infrastructure
Deploy DynamoDB as a Chalk online store in a single region or with multi-region replication.
Chalk supports DynamoDB as an online store. Online query results and cached feature values are written to DynamoDB by the background persistence writers and read directly by the query servers. This page covers how to size DynamoDB capacity, how to choose between single-region and multi-region deployments, and how to provision everything via Terraform.
Chalk’s DynamoDB online store uses a single table per environment with feature keys encoded to minimize both storage and capacity consumption: values are stored using native DynamoDB data types (not JSON-encoded strings), and feature names are compressed to short stable identifiers. This means that DynamoDB capacity sizing in practice consumes noticeably less WCU/RCU than a naive estimate based on the raw JSON size of a feature set would suggest.
Chalk supports both DynamoDB and Valkey (or Redis) as online stores. The right choice depends on your workload:
A common pattern is DynamoDB with an LRU cache and/or a Bloom filter to minimize cache reads.
Chalk’s DynamoDB encoding (native dtypes + short feature identifiers) keeps per-item payloads small, so RCU/WCU calculations are typically driven by query volume and the number of features read per query rather than by raw payload size.
A useful starting point:
Chalk will assist with initial sizing based on your query mix, but the customer is ultimately responsible for choosing and tuning DynamoDB capacity: DynamoDB capacity is a direct cost driver, and the tradeoffs between provisioned, on-demand, and autoscaled capacity are workload-specific and owned by the customer.
DynamoDB offers three capacity modes, each with different cost and operational characteristics:
For most production Chalk deployments, provisioned-with-autoscaling is the right default: it amortizes the steady-state cost advantage of provisioned capacity while still absorbing diurnal traffic variation. Reserve on-demand for environments with highly unpredictable traffic or very low steady-state utilization.
Chalk supports DynamoDB online stores in either a single region or replicated across multiple regions using DynamoDB Global Tables.
A single-region deployment is the simplest configuration: one table in one region, accessed by Chalk query servers running in the same region. If the region becomes unavailable, the online store is unavailable and online queries will fail until the region recovers. Single-region is appropriate when your application’s availability requirements do not extend beyond a single AWS region.
Global Tables replicate items asynchronously between regions, typically with sub-second propagation under normal conditions. Chalk recommends asynchronous replication (the default for Global Tables) rather than attempting to build strongly consistent cross-region writes: synchronous cross-region replication would require every online write to commit in at least two regions before returning, which is prohibitively expensive in both latency (adding one inter-region round trip per write) and cost.
Because replication is asynchronous, a regional failover can lose the last few seconds of writes that had not yet replicated from the lost primary region. In practice, this achieves RPO < 1 minute: the write lag for Global Tables is typically under 1 second during normal operation, and even under regional stress has historically stayed well below a minute. For Chalk online queries, the practical effect of this RPO is that a small number of the most recently persisted query results may be missing after failover, forcing re-computation on the next query; feature values themselves are not corrupted.
The tradeoff: accept a small RPO in exchange for (a) much lower write latency, (b) lower cost, and (c) a simpler operational model. Applications that cannot tolerate any lost writes must use a different persistence model than an online feature store.
See Multi-Region Failover for the Chalk-level configuration that steers query traffic to a healthy region.
Chalk will assist with DynamoDB sizing, capacity-mode selection, and replication topology, but the customer is ultimately responsible for provisioning and operating the DynamoDB table. This is intentional: DynamoDB capacity is a direct cost driver that the customer controls, and capacity decisions must be made against the customer’s own cost model and availability targets.
Chalk’s responsibilities are:
Customer responsibilities are:
A single-region DynamoDB table with provisioned capacity:
resource "aws_dynamodb_table" "chalk_online_store" {
name = "chalk-online-store"
billing_mode = "PROVISIONED"
read_capacity = 1000
write_capacity = 500
hash_key = "pk"
range_key = "sk"
attribute {
name = "pk"
type = "S"
}
attribute {
name = "sk"
type = "S"
}
point_in_time_recovery {
enabled = true
}
server_side_encryption {
enabled = true
}
tags = {
chalk_environment = "production"
}
}Switch billing_mode to PAY_PER_REQUEST and remove the read_capacity / write_capacity
fields for on-demand.
A multi-region deployment uses a single aws_dynamodb_table resource with replica blocks.
Global Tables require stream_enabled = true and stream_view_type = "NEW_AND_OLD_IMAGES":
resource "aws_dynamodb_table" "chalk_online_store" {
name = "chalk-online-store"
billing_mode = "PROVISIONED"
read_capacity = 1000
write_capacity = 500
hash_key = "pk"
range_key = "sk"
stream_enabled = true
stream_view_type = "NEW_AND_OLD_IMAGES"
attribute {
name = "pk"
type = "S"
}
attribute {
name = "sk"
type = "S"
}
replica {
region_name = "us-east-1"
}
replica {
region_name = "us-west-2"
}
point_in_time_recovery {
enabled = true
}
server_side_encryption {
enabled = true
}
tags = {
chalk_environment = "production"
}
}Each replica is a full copy of the table in the specified region; replication is asynchronous with typical lag well under a second. Provisioned capacity applies per-region and must be sized for each region’s local traffic.
An autoscaling policy tracks target utilization on read and write capacity. Attach one pair of scalable targets and policies per capacity dimension:
resource "aws_appautoscaling_target" "read_target" {
max_capacity = 5000
min_capacity = 500
resource_id = "table/${aws_dynamodb_table.chalk_online_store.name}"
scalable_dimension = "dynamodb:table:ReadCapacityUnits"
service_namespace = "dynamodb"
}
resource "aws_appautoscaling_policy" "read_policy" {
name = "chalk-online-store-read-autoscaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.read_target.resource_id
scalable_dimension = aws_appautoscaling_target.read_target.scalable_dimension
service_namespace = aws_appautoscaling_target.read_target.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "DynamoDBReadCapacityUtilization"
}
target_value = 70.0
scale_in_cooldown = 60
scale_out_cooldown = 60
}
}
resource "aws_appautoscaling_target" "write_target" {
max_capacity = 2500
min_capacity = 250
resource_id = "table/${aws_dynamodb_table.chalk_online_store.name}"
scalable_dimension = "dynamodb:table:WriteCapacityUnits"
service_namespace = "dynamodb"
}
resource "aws_appautoscaling_policy" "write_policy" {
name = "chalk-online-store-write-autoscaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.write_target.resource_id
scalable_dimension = aws_appautoscaling_target.write_target.scalable_dimension
service_namespace = aws_appautoscaling_target.write_target.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "DynamoDBWriteCapacityUtilization"
}
target_value = 70.0
scale_in_cooldown = 60
scale_out_cooldown = 60
}
}A 70% target utilization is a conservative starting point that leaves headroom for the
reactive scale-up delay. For workloads with sharper traffic spikes, lower the target to 50-60%
or raise min_capacity so that the floor already covers expected peak-to-trough variation.
For Global Tables, configure autoscaling independently in each region.
The Chalk-side DynamoDB client exposes a number of tuning knobs as environment variables. These are set on the Chalk engine and persistence-writer deployments and control client concurrency, batching, caching, retries, and consistency behavior. Most defaults are tuned for typical online-serving workloads; the settings below are documented primarily so that operators can diagnose throughput problems and adjust where the defaults do not match a specific workload.
DynamoDB requests are issued by a pool of client threads against a fixed pool of HTTP connections. Serialization and deserialization of items happens on a separate pool of serde threads so that CPU-bound encoding work does not block the I/O threads.
| Name | Default | Description |
|---|---|---|
DYNAMODB_NUM_CLIENT_THREADS | 2 * desired_cpu_parallelism | Number of threads in the DynamoDB client pool. These threads issue and await BatchGetItem / BatchWriteItem / TransactWriteItems calls. Increase for read-heavy workloads that bottleneck on I/O wait. |
DYNAMODB_NUM_CLIENT_CONNECTIONS | 2 * desired_cpu_parallelism | Maximum number of concurrent HTTP connections to DynamoDB. Should generally be set equal to or slightly above DYNAMODB_NUM_CLIENT_THREADS. Each connection corresponds to a TCP/TLS session. |
DYNAMODB_NUM_SERDE_THREADS | desired_cpu_parallelism | Number of threads used to encode/decode DynamoDB items. CPU-bound; increase if profiling shows serde saturation while client threads are idle. |
BatchGetItem requests are split into multiple parallel sub-batches. The configuration
below controls how those sub-batches are sized.
| Name | Default | Description |
|---|---|---|
DYNAMODB_GETITEM_MIN_BATCH_SIZE | 10 | Minimum number of keys per BatchGetItem sub-batch. The DynamoDB protocol allows up to 100 keys per batch; empirically, batches of fewer than ~10 keys are no faster than a 10-key batch, so smaller splits only add request overhead. |
DYNAMODB_GETITEM_MIN_BATCH_CONCURRENCY | DYNAMODB_NUM_CLIENT_THREADS | Maximum number of parallel sub-batches per BatchGetItem request. Defaults to the size of the client thread pool. Lowering this is useful when an environment receives many concurrent queries: each individual query is then satisfied with fewer (larger) batches, leaving more client threads available to other queries. |
Chalk supports per-namespace LRU caching of feature values in front of DynamoDB, and a Bloom filter to short-circuit reads for keys known to be absent. These reduce DynamoDB RCU consumption and tail latency at the cost of memory.
| Name | Default | Description |
|---|---|---|
DYNAMODB_CACHED_NAMESPACES | None | JSON list of namespace cache configurations of the form {"namespace": "...", "ttl_seconds": 86400, "max_lru_size": 10000}. max_lru_size is optional; if omitted, the cache grows without bound. Use for hot namespaces where stale-by-up-to-ttl_seconds reads are acceptable. |
DYNAMODB_LRU_CACHE_CACHE_MISSES | true | When true, the namespace LRU cache also caches negative results (rows that did not exist in DynamoDB). Set to false to re-query on every miss; useful when missing rows are expected to be created by an out-of-band writer that the engine should observe quickly. |
DYNAMODB_BLOOM_FILTER_DEBUG_MODE | false | When true, the Bloom filter still issues the underlying DynamoDB read on a Bloom hit/miss and verifies that the Bloom filter’s prediction was consistent with the actual store. Use only for debugging false-positive/negative rates; this disables the latency benefit of the filter. |
When request racing is enabled, slow BatchGetItem calls are duplicated after a configured
wait. The first response wins. This trades extra RCU consumption for better p99 read latency
when DynamoDB occasionally serves a request slowly.
Request racing is one of the most effective knobs available for cutting DynamoDB tail
latency. We recommend setting DYNAMODB_REQUEST_RACING_WAIT_TIME to roughly the p95 of
observed DynamoDB request latency: at that threshold, only the slowest ~5% of requests are
duplicated, so the additional RCU cost is small while the p99/p99.9 read tail collapses
toward p95. Setting the wait time meaningfully below p95 amplifies RCU consumption without
much further tail benefit; setting it above p95 leaves significant tail latency on the table.
| Name | Default | Description |
|---|---|---|
DYNAMODB_ENABLE_REQUEST_RACING | false | Master switch for request racing. When true, DYNAMODB_REQUEST_RACING_WAIT_TIME must also be set. |
DYNAMODB_REQUEST_RACING_WAIT_TIME | - | Wait time in milliseconds before issuing a duplicate request. Recommended value: roughly the p95 of DynamoDB request latency for this environment. Lower values cut tail latency more aggressively but also amplify RCU consumption on every slow request. |
| Name | Default | Description |
|---|---|---|
DYNAMODB_CHECK_TS_FOR_BULK_WRITES | true | When true, bulk writes use DynamoDB transactional updates that skip the write if the existing observed-at timestamp is newer than the incoming value. Prevents stale data from overwriting fresher data when writers race. Transactional writes cost 2x WCU; set to false if your pipeline already guarantees monotonic write ordering. |
DYNAMODB_ONLY_WRITE_NEWER_VALUES | true | Conditional-update guard for the non-bulk write path. When true, the per-item update expression compares observed-at timestamps and skips the write if the existing value is newer. Disable only if you are certain that all writers issue strictly monotonic timestamps. |
DYNAMODB_TRANSACTION_WRITE_CONFLICT_MIN_RETRY_MILLIS | 50 | Initial backoff (milliseconds) when a transactional write fails due to a TransactionConflictException. Subsequent retries scale this value with jitter. |
DYNAMODB_TRANSACTION_WRITE_CONFLICT_MAX_RETRIES | 5 | Maximum number of retries on a transactional write conflict before surfacing the error. Increase if your workload has high contention on the same key (e.g. many writers updating the same entity). |
DYNAMODB_AGGREGATE_UPDATE_CACHE_SIZE | 256 | In-memory cache size, in entries, for materialized aggregation buckets used to speed up updates to non-trivial aggregations such as approx-count-distinct. Tune to roughly match the number of frequently updating buckets at any given time. Monitor with the chalk.libdynamo.num_update_cache_* metrics. |
| Name | Default | Description |
|---|---|---|
DYNAMODB_MAX_RETRIES | 12 | Maximum number of retries on retryable DynamoDB errors (throttling, transient 5xx). Combined with DYNAMODB_RETRY_SCALE_FACTOR, this controls how aggressively the client absorbs throttling. |
DYNAMODB_RETRY_SCALE_FACTOR | 10 | Multiplier applied to exponential-backoff delays. Higher values smooth out throughput during sustained throttling at the cost of higher per-request latency. |
DYNAMODB_REQUEST_TIMEOUT_MS | None | Per-request timeout in milliseconds. When unset, the AWS SDK default is used. Set this if you would rather fail fast than wait on a slow region. |
| Name | Default | Description |
|---|---|---|
DYNAMODB_WARMUP_FQN_MAPPING | true | Chalk stores a short stable identifier for each fully-qualified feature name in DynamoDB to keep items small. When true, the engine pre-loads the entire FQN→short-name mapping at startup. When false, mappings are computed and cached lazily on first use of each feature. |
DYNAMODB_CREATE_TABLES_IF_NOT_EXISTS | false | When true, the engine will attempt to create the DynamoDB table at startup if it does not already exist. Off by default because production tables should be provisioned via Terraform (see above) so that capacity, replication, and IAM are managed with the rest of the customer’s infrastructure. |