Infrastructure
Deploy Amazon MSK as Chalk's asynchronous persistence queue in a single region or with multi-region replication.
Chalk uses Amazon MSK (Managed Streaming for Apache Kafka) as the asynchronous persistence queue between the query servers and the background persistence writers. Specifically, MSK serves two roles:
Crucially, MSK is not on the synchronous online query path. Online query results are served directly from the query servers and the online store — we do not cache query responses in MSK. As a result, MSK is sized for throughput and durability, not for latency.
MSK capacity for Chalk is overwhelmingly driven by storage, not broker compute. For the vast majority of production Chalk deployments, the following is sufficient:
kafka.m7g.large or equivalent). MSK brokers do not need to be large — the message volume
Chalk produces is modest relative to what Kafka can handle.Storage is typically the dominant cost and scaling dimension. If MSK costs are becoming meaningful, the primary lever is message retention: Chalk’s writers consume messages quickly under normal operation, so retention exists mainly to tolerate writer outages and provide a replay window for debugging. Shortening retention (for example, from 7 days to 24 hours) directly reduces storage needs.
Broker compute needs rarely require attention. If CPU on the brokers is actually saturated, the usual culprit is a writer lag problem — contact Chalk Support before resizing brokers.
Chalk will assist with initial sizing and any later capacity questions, but the customer is ultimately responsible for provisioning and operating the MSK cluster. This is intentional: MSK is a direct cost driver and is part of the customer’s cloud account, and its availability configuration (single-region vs. multi-region, retention, broker class) reflects the customer’s own availability targets and cost model.
Chalk’s responsibilities are:
Customer responsibilities are:
A single-region MSK cluster is appropriate for the majority of production deployments. If the region (or the MSK cluster) becomes unavailable, the effect on Chalk is limited to write paths and does not bring down online query serving:
Single-region is therefore a reasonable choice if you can tolerate write unavailability for metrics and background persistence during a region outage. If the workload writing to MSK is primarily metrics and query result persistence — not load-bearing for online query availability — single-region is often the right tradeoff.
Chalk supports multi-region MSK deployments using MSK Replicator for asynchronous replication between clusters. The pattern is a primary MSK cluster in the active region and a secondary MSK cluster in a standby region, with MSK Replicator continuously copying topics from primary to secondary.
Asynchronous replication achieves RPO < 1 minute: under normal conditions the replicator lag is a small number of seconds, and even under stress it has historically stayed well below a minute. Failover from primary to secondary is a zero-downtime operation: the secondary cluster is already running and caught up, and the Chalk producers and consumers simply reconfigure to the secondary bootstrap servers.
The tradeoff of async replication is that the last few seconds of writes at the moment of failover can be lost. For Chalk’s use of MSK — metrics and persisted query results — this is an acceptable loss: it translates to a small gap in dashboards and a small number of cache entries that may need to be re-computed on subsequent queries. It does not corrupt feature values or affect online query correctness.
To guarantee RPO = 0 we would need synchronous cross-region replication, meaning each produce would not acknowledge until the message was durably committed in both regions. This adds one inter-region round trip to every produce (typically 20-80ms) and roughly doubles MSK cost, which is not an acceptable tradeoff for an asynchronous persistence queue.
A single-region MSK cluster sized for a typical production Chalk deployment:
resource "aws_msk_configuration" "chalk" {
name = "chalk-msk-config"
kafka_versions = ["3.6.0"]
server_properties = <<PROPERTIES
auto.create.topics.enable=false
log.retention.hours=168
min.insync.replicas=2
default.replication.factor=3
num.partitions=12
PROPERTIES
}
resource "aws_msk_cluster" "chalk" {
cluster_name = "chalk-msk"
kafka_version = "3.6.0"
number_of_broker_nodes = 3
broker_node_group_info {
instance_type = "kafka.m7g.large"
client_subnets = var.private_subnet_ids
security_groups = [aws_security_group.msk.id]
storage_info {
ebs_storage_info {
volume_size = 500
}
}
}
configuration_info {
arn = aws_msk_configuration.chalk.arn
revision = aws_msk_configuration.chalk.latest_revision
}
client_authentication {
sasl {
scram = true
}
}
encryption_info {
encryption_in_transit {
client_broker = "TLS"
in_cluster = true
}
}
tags = {
chalk_environment = "production"
}
}For most production workloads this configuration carries substantial headroom. Increase
volume_size if you lengthen retention, and increase instance_type only if Chalk Support
has identified a genuine broker bottleneck.
A multi-region deployment provisions an MSK cluster in each region and an aws_msk_replicator
resource that asynchronously replicates topics from the primary cluster to the secondary:
resource "aws_msk_cluster" "chalk_primary" {
provider = aws.us_east_1
cluster_name = "chalk-msk-primary"
kafka_version = "3.6.0"
number_of_broker_nodes = 3
broker_node_group_info {
instance_type = "kafka.m7g.large"
client_subnets = var.primary_private_subnet_ids
security_groups = [aws_security_group.msk_primary.id]
storage_info {
ebs_storage_info {
volume_size = 500
}
}
}
client_authentication {
sasl { scram = true }
}
encryption_info {
encryption_in_transit {
client_broker = "TLS"
in_cluster = true
}
}
}
resource "aws_msk_cluster" "chalk_secondary" {
provider = aws.us_west_2
cluster_name = "chalk-msk-secondary"
kafka_version = "3.6.0"
number_of_broker_nodes = 3
broker_node_group_info {
instance_type = "kafka.m7g.large"
client_subnets = var.secondary_private_subnet_ids
security_groups = [aws_security_group.msk_secondary.id]
storage_info {
ebs_storage_info {
volume_size = 500
}
}
}
client_authentication {
sasl { scram = true }
}
encryption_info {
encryption_in_transit {
client_broker = "TLS"
in_cluster = true
}
}
}
resource "aws_msk_replicator" "chalk" {
provider = aws.us_east_1
replicator_name = "chalk-msk-replicator"
service_execution_role_arn = aws_iam_role.msk_replicator.arn
kafka_cluster {
amazon_msk_cluster {
msk_cluster_arn = aws_msk_cluster.chalk_primary.arn
}
vpc_config {
subnet_ids = var.primary_private_subnet_ids
security_groups_ids = [aws_security_group.msk_primary.id]
}
}
kafka_cluster {
amazon_msk_cluster {
msk_cluster_arn = aws_msk_cluster.chalk_secondary.arn
}
vpc_config {
subnet_ids = var.secondary_private_subnet_ids
security_groups_ids = [aws_security_group.msk_secondary.id]
}
}
replication_info_list {
source_kafka_cluster_arn = aws_msk_cluster.chalk_primary.arn
target_kafka_cluster_arn = aws_msk_cluster.chalk_secondary.arn
target_compression_type = "NONE"
topic_replication {
topics_to_replicate = [".*"]
detect_and_copy_new_topics = true
copy_topic_configurations = true
copy_access_control_lists_for_topics = true
}
consumer_group_replication {
consumer_groups_to_replicate = [".*"]
}
}
}Failover is a reconfiguration of the Chalk producers and consumers to point at the secondary cluster’s bootstrap servers; this is handled alongside the Chalk-level Multi-Region Failover procedure.