MSK Deployment

Overview

Chalk uses Amazon MSK (Managed Streaming for Apache Kafka) as the asynchronous persistence queue between the query servers and the background persistence writers. Specifically, MSK serves two roles:

Persistence of logged feature values and query results. Online and offline query results that should be written to the online store, offline store, or feature caches are published to MSK topics and consumed by the appropriate background writer.
Coordination of bulk inserts into the offline store. Bulk parquet uploads to the offline store are coordinated through MSK so that writers can batch, retry, and apply backpressure without any one query path blocking on storage.

Crucially, MSK is not on the synchronous online query path. Online query results are served directly from the query servers and the online store — we do not cache query responses in MSK. As a result, MSK is sized for throughput and durability, not for latency.

Sizing

MSK capacity for Chalk is overwhelmingly driven by storage, not broker compute. For the vast majority of production Chalk deployments, the following is sufficient:

Broker type: the smallest broker size marked as production-recommended by AWS (currently kafka.m7g.large or equivalent). MSK brokers do not need to be large — the message volume Chalk produces is modest relative to what Kafka can handle.
Replicas: 3 broker replicas across 3 availability zones. This is the minimum for a production Kafka cluster.
Storage per broker: approximately 500 GiB, sized for your chosen retention window.

Storage is typically the dominant cost and scaling dimension. If MSK costs are becoming meaningful, the primary lever is message retention: Chalk’s writers consume messages quickly under normal operation, so retention exists mainly to tolerate writer outages and provide a replay window for debugging. Shortening retention (for example, from 7 days to 24 hours) directly reduces storage needs.

Broker compute needs rarely require attention. If CPU on the brokers is actually saturated, the usual culprit is a writer lag problem — contact Chalk Support before resizing brokers.

Shared responsibility

Chalk will assist with initial sizing and any later capacity questions, but the customer is ultimately responsible for provisioning and operating the MSK cluster. This is intentional: MSK is a direct cost driver and is part of the customer’s cloud account, and its availability configuration (single-region vs. multi-region, retention, broker class) reflects the customer’s own availability targets and cost model.

Chalk’s responsibilities are:

Advising on broker size, replica count, and retention based on expected throughput
Operating the Chalk producers and consumers that interact with MSK
Reporting consumer lag and writer error rates in the Chalk UI

Customer responsibilities are:

Provisioning the MSK cluster and its replicator (if multi-region)
Choosing broker class, storage, and retention
Managing IAM / SASL credentials and rotating them
Network-level connectivity between the cluster and Chalk’s Kubernetes namespaces

Single-region vs. multi-region

Single-region

A single-region MSK cluster is appropriate for the majority of production deployments. If the region (or the MSK cluster) becomes unavailable, the effect on Chalk is limited to write paths and does not bring down online query serving:

Online queries continue to work. Chalk does not cache online query responses in MSK, and reads go directly to the online store. Online queries remain available throughout an MSK outage.
Streaming resolvers continue to process upstream events. Streaming sources are consumed directly from the upstream Kafka / Pub/Sub topics, not from MSK, so streaming resolvers continue to ingest and compute features during an MSK outage.
Metrics and logged feature values fail to persist. Query metrics and result-bus messages that would normally be consumed by the background writers will queue up at the producer side, and eventually fail. This shows up as gaps in dashboards and in the metrics store.
Online store writes are delayed. Persisted query results that would normally be written back to the online store cache will be delayed or lost, which may cause higher cold-miss rates on subsequent queries.

Single-region is therefore a reasonable choice if you can tolerate write unavailability for metrics and background persistence during a region outage. If the workload writing to MSK is primarily metrics and query result persistence — not load-bearing for online query availability — single-region is often the right tradeoff.

Multi-region

Chalk supports multi-region MSK deployments using MSK Replicator for asynchronous replication between clusters. The pattern is a primary MSK cluster in the active region and a secondary MSK cluster in a standby region, with MSK Replicator continuously copying topics from primary to secondary.

Asynchronous replication achieves RPO < 1 minute: under normal conditions the replicator lag is a small number of seconds, and even under stress it has historically stayed well below a minute. Failover from primary to secondary is a zero-downtime operation: the secondary cluster is already running and caught up, and the Chalk producers and consumers simply reconfigure to the secondary bootstrap servers.

The tradeoff of async replication is that the last few seconds of writes at the moment of failover can be lost. For Chalk’s use of MSK — metrics and persisted query results — this is an acceptable loss: it translates to a small gap in dashboards and a small number of cache entries that may need to be re-computed on subsequent queries. It does not corrupt feature values or affect online query correctness.

To guarantee RPO = 0 we would need synchronous cross-region replication, meaning each produce would not acknowledge until the message was durably committed in both regions. This adds one inter-region round trip to every produce (typically 20-80ms) and roughly doubles MSK cost, which is not an acceptable tradeoff for an asynchronous persistence queue.

Example Terraform: single-region

A single-region MSK cluster sized for a typical production Chalk deployment:

resource "aws_msk_configuration" "chalk" {
  name              = "chalk-msk-config"
  kafka_versions    = ["3.6.0"]
  server_properties = <<PROPERTIES
auto.create.topics.enable=false
log.retention.hours=168
min.insync.replicas=2
default.replication.factor=3
num.partitions=12
PROPERTIES
}

resource "aws_msk_cluster" "chalk" {
  cluster_name           = "chalk-msk"
  kafka_version          = "3.6.0"
  number_of_broker_nodes = 3

  broker_node_group_info {
    instance_type  = "kafka.m7g.large"
    client_subnets = var.private_subnet_ids
    security_groups = [aws_security_group.msk.id]

    storage_info {
      ebs_storage_info {
        volume_size = 500
      }
    }
  }

  configuration_info {
    arn      = aws_msk_configuration.chalk.arn
    revision = aws_msk_configuration.chalk.latest_revision
  }

  client_authentication {
    sasl {
      scram = true
    }
  }

  encryption_info {
    encryption_in_transit {
      client_broker = "TLS"
      in_cluster    = true
    }
  }

  tags = {
    chalk_environment = "production"
  }
}

For most production workloads this configuration carries substantial headroom. Increase volume_size if you lengthen retention, and increase instance_type only if Chalk Support has identified a genuine broker bottleneck.

Example Terraform: multi-region with MSK Replicator

A multi-region deployment provisions an MSK cluster in each region and an aws_msk_replicator resource that asynchronously replicates topics from the primary cluster to the secondary:

resource "aws_msk_cluster" "chalk_primary" {
  provider               = aws.us_east_1
  cluster_name           = "chalk-msk-primary"
  kafka_version          = "3.6.0"
  number_of_broker_nodes = 3

  broker_node_group_info {
    instance_type   = "kafka.m7g.large"
    client_subnets  = var.primary_private_subnet_ids
    security_groups = [aws_security_group.msk_primary.id]

    storage_info {
      ebs_storage_info {
        volume_size = 500
      }
    }
  }

  client_authentication {
    sasl { scram = true }
  }

  encryption_info {
    encryption_in_transit {
      client_broker = "TLS"
      in_cluster    = true
    }
  }
}

resource "aws_msk_cluster" "chalk_secondary" {
  provider               = aws.us_west_2
  cluster_name           = "chalk-msk-secondary"
  kafka_version          = "3.6.0"
  number_of_broker_nodes = 3

  broker_node_group_info {
    instance_type   = "kafka.m7g.large"
    client_subnets  = var.secondary_private_subnet_ids
    security_groups = [aws_security_group.msk_secondary.id]

    storage_info {
      ebs_storage_info {
        volume_size = 500
      }
    }
  }

  client_authentication {
    sasl { scram = true }
  }

  encryption_info {
    encryption_in_transit {
      client_broker = "TLS"
      in_cluster    = true
    }
  }
}

resource "aws_msk_replicator" "chalk" {
  provider          = aws.us_east_1
  replicator_name   = "chalk-msk-replicator"
  service_execution_role_arn = aws_iam_role.msk_replicator.arn

  kafka_cluster {
    amazon_msk_cluster {
      msk_cluster_arn = aws_msk_cluster.chalk_primary.arn
    }
    vpc_config {
      subnet_ids         = var.primary_private_subnet_ids
      security_groups_ids = [aws_security_group.msk_primary.id]
    }
  }

  kafka_cluster {
    amazon_msk_cluster {
      msk_cluster_arn = aws_msk_cluster.chalk_secondary.arn
    }
    vpc_config {
      subnet_ids         = var.secondary_private_subnet_ids
      security_groups_ids = [aws_security_group.msk_secondary.id]
    }
  }

  replication_info_list {
    source_kafka_cluster_arn = aws_msk_cluster.chalk_primary.arn
    target_kafka_cluster_arn = aws_msk_cluster.chalk_secondary.arn
    target_compression_type  = "NONE"

    topic_replication {
      topics_to_replicate            = [".*"]
      detect_and_copy_new_topics     = true
      copy_topic_configurations      = true
      copy_access_control_lists_for_topics = true
    }

    consumer_group_replication {
      consumer_groups_to_replicate = [".*"]
    }
  }
}

Failover is a reconfiguration of the Chalk producers and consumers to point at the secondary cluster’s bootstrap servers; this is handled alongside the Chalk-level Multi-Region Failover procedure.

​Overview

​Sizing

​Shared responsibility

​Single-region vs. multi-region

​Single-region

​Multi-region

​Example Terraform: single-region

​Example Terraform: multi-region with MSK Replicator

On this page