Chalk uses background persistence writers to asynchronously persist query results to the online store, offline store, and metrics store. When these writers encounter issues, features may appear stale, offline queries may return incomplete data, or online store reads may return unexpected results. This guide walks through how to identify and resolve common persistence problems.


How persistence works

Understanding the data flow is essential for debugging. When a query executes on the engine, results flow through a message bus before reaching their final storage destinations:

  1. The query server (engine) computes feature values and publishes them to the result bus (a Pub/Sub topic on GCP or an MSK/Kafka topic on AWS).
  2. Background persistence writers consume messages from the result bus and write to their respective stores:
    • The online writer persists values to the online store (ElastiCache, DynamoDB, Redis, etc.) for low-latency reads.
    • The offline writer reads from the result bus, creates per-feature table schemas in the offline store if they don’t already exist, transforms query results into parquet-encoded messages, and publishes them to the offline store streaming insert topic. It does not insert data into the database directly.
    • The offline store streaming insert writer reads from the streaming insert topic and performs the actual database writes to per-feature observation tables and the query_log table in the offline store. Note that other components (such as the query engine) may also publish messages directly to the streaming insert topic, so this writer handles more than just the output of the offline writer.
    • The offline store bulk insert writer handles writes that originate as parquet files in cloud storage (S3 or GCS) rather than as messages on a bus. When components like the offline query engine run with store_offline=True, they write result parquet files to a cloud storage bucket and publish a notification to a bulk upload topic. The bulk insert writer picks up these notifications and loads the files into the offline store using COPY INTO operations. This writer is required for non-BigQuery offline store backends (Snowflake, Redshift, Databricks); for BigQuery, dedicated BigQuery streaming writers handle this role instead. It requires a configured upload bucket (BQ_UPLOAD_BUCKET) with the correct cloud prefix (s3:// for AWS, gs:// for GCP).
    • The metrics writer transforms results into metrics format and publishes them to a metrics bus, where the metrics bus writer persists them to the metrics store for observability.
  3. Each writer runs as an independent Kubernetes deployment, autoscaled by KEDA based on queue depth.

The offline store write path is a two-stage pipeline: the offline writer creates table schemas, transforms result bus messages, and republishes them to the streaming insert topic, then the streaming insert writer writes the data to the database. Both must be deployed for offline persistence to function. The bulk insert writer is independent of this pipeline — it handles loading parquet files from cloud storage into the offline store, which is how offline query results (e.g. from store_offline=True) are persisted. Which offline writers you need depends on your offline store backend: for BigQuery, use dedicated BigQuery streaming writers; for all other backends (Snowflake, Redshift, Databricks), use the bulk insert writer and streaming insert writer.

A failure at any stage of this pipeline can cause data to stop flowing to one or more stores.


Triage flowchart

When you suspect a persistence issue, follow this decision tree:

1. Identify the symptom

Start by determining which store is affected:

  • Online store returning stale or null values — the online writer may be down or behind.
  • Offline queries returning incomplete data — either the offline writer or the streaming insert writer may be down or behind.
  • Missing metrics in the dashboard — the metrics writer may be affected.
  • All stores affected — the result bus itself may be clogged, or the query server may not be publishing.

2. Check Kubernetes pod status

Navigate to Infrastructure > Kube Events in the Chalk dashboard and filter by the background-persistence namespace. If the background-persistence namespace does not exist, the writers may not exist and we need to create them in the dashboard by navigating to Settings > Shared Resources > Background Persistence.

Look for pods in an unhealthy state:

  • OOMKilled — the writer is running out of memory. See Raising memory requests below.
  • CrashLoopBackOff — the writer is repeatedly crashing. Check logs for the root cause.
  • ContainerCreating (stuck for more than a few minutes) — there may be a node scheduling issue or resource shortage.

3. Check the result bus

If writers appear healthy but data is not flowing, the result bus itself may be the bottleneck. Navigate to Infrastructure > Kubernetes and expand the drop down for Background persistence writers. Click on a result-bus-offline-writer or result-bus-online-writer pod to access the logs that can help indicate why data may not be ingesting.

and filter by component:"background-persistence" to look for consumer lag or error messages on the pods with the names rust-result-bus-online-writer and the result-bus-offline-writer.

Signs of a clogged result bus include:

  • Consumer lag increasing over time (messages are being produced faster than consumed).
  • Writers logging errors when attempting to deserialize or process messages.
  • Writers processing messages but failing to write to the downstream store (e.g. connection errors to Redis or BigQuery).

Common issues and solutions

OOMKilled writers

This is the most common persistence issue. When a writer’s memory usage exceeds its Kubernetes memory limit, the pod is terminated with an OOMKilled status. While KEDA will restart the pod, repeated OOM kills cause the writer to fall behind, creating a backlog on the result bus.

Symptoms: pods cycling between Running and OOMKilled, increasing consumer lag on the result bus, stale or missing feature values.

Solution: increase the memory request and limit for the affected writer. See Raising memory requests.

Result bus backlog

If writers were down for a period (due to OOM kills, crashes, or scaling issues), a backlog of unprocessed messages may accumulate on the result bus. Once writers are healthy again, they will work through the backlog, but this can take time depending on the volume.

Symptoms: writers are healthy and running, but data is still delayed. Consumer lag is high but decreasing.

Solution: wait for the backlog to drain. If you need to accelerate processing, you can temporarily increase the replica count for the affected writer in the background persistence configuration.

Writer crash loops

If a writer is in CrashLoopBackOff, it is repeatedly failing to start.

Symptoms: pod restarts with increasing backoff intervals.

Solution: check the pod logs in Infrastructure > Kubernetes and expanding the Background Persistence drop down.

Features not persisting to offline store

If online queries return correct values but offline data is missing or stale:

  • Verify that both the offline writer and the offline store streaming insert writer are deployed and healthy. A common issue is having the offline writer configured but missing the streaming insert writer (or vice versa). Check Settings > Team > Shared Resources > Background Persistence to confirm both are present.
  • Check the Health Metrics dropdown on the Background Persistence page — if the queues associated with offline writers are growing, the writers may need more replicas or resources.
  • Check that skip_offline is not set to True on the relevant stream resolvers.
  • Ensure that max_staleness is configured appropriately.
  • If offline query results (from store_offline=True) are not appearing in the offline store, the bulk insert writer may be misconfigured or missing. This writer is required for non-BigQuery backends. Verify that BQ_UPLOAD_BUCKET is set to a valid cloud storage path with the correct prefix (s3:// for AWS, gs:// for GCP), and that the associated storage integration has the correct permissions.

Writer stuck in ContainerCreating

If a writer pod is stuck in ContainerCreating for an extended period, it is typically unable to mount a required volume or secret.

Symptoms: pod remains in ContainerCreating state for more than a few minutes. Kube Events may show FailedMount errors.

Solution: check the pod logs in Infrastructure > Kubernetes and expanding the Background Persistence drop down for a possible cause.

Features not persisting to online store

If online queries unexpectedly get cache misses:

  • Verify that the online writer is deployed and healthy.
  • Confirm the features queried are configured with a max_staleness.

Raising memory requests

Background persistence writers run as Kubernetes deployments with configurable CPU and memory requests and limits. When a writer is OOMKilled, you need to increase its memory allocation.

Via the Chalk dashboard UI

  1. Navigate to Settings > Shared Resources > Background Persistence.
  2. Select the writer to increase the memory on.
  3. Increase the memory value in both the request and limit fields.
  4. Click Save and Apply.

Additional resources