Chalk home page
Docs
SDK
CLI
  1. Infrastructure
  2. Chalk Dataplane Installation (Background Persistence)

Chalk uses background writers hosted in the (“Customer Cloud”) Kubernetes cluster to write information about queries to various storage locations.

Prerequisites

In order to install Chalk persistence writers, you need to have the following:

  • Namespace to deploy the database into

If using Kafka:

  • Kafka brokers
  • Kafka topics for each bus
  • Kafka authentication secret stored in AWS Secrets Manager

If using Pubsub:

  • Topics for each bus
  • Subscriptions for each bus

Creating a Background Persistence Writers

Navigate to the Settings/Team/Shared Resources/Background Persistence page in the Chalk UI to view the background persistence configuration. If no background persistence is configured, you will see a message indicating that no background persistence is currently present, and the first save and apply will create background persistence writers.

Writer Types

Chalk supports different types of background persistence writers, each designed for specific data flow and storage purposes:

Online Store Writers

  • ONLINE_WRITER: Listens to the online store subscription and writes query results into the online store (Redis, DynamoDB, etc.). This ensures that computed feature values are persisted for fast retrieval during online inference.

Metrics and Monitoring Writers

  • METRICS_WRITER: Listens to the result metrics subscription and republishes messages as metrics bus messages. This writer transforms query result data into metrics format for downstream processing.
  • METRICS_BUS_WRITER: Listens to the metrics bus subscription and writes processed metrics to the metrics store. This enables monitoring and observability of feature computation performance.

Offline Store Writers

  • OFFLINE_WRITER: Listens to the offline store subscription and writes query results into the offline store (BigQuery, Snowflake, etc.). This provides historical data for training and batch inference.
  • OFFLINE_STORE_BULK_INSERT_WRITER: Listens to the bulk upload subscription and performs efficient bulk inserts of parquet files from cloud storage (GCS/S3) into the offline store using COPY INTO operations.
  • OFFLINE_STORE_STREAMING_INSERT_WRITER: Listens to the streaming write subscription and handles real-time data insertion. The message body contains parquet-encoded data with table metadata. While the subscription name references BigQuery, this writer supports all offline store backends. For BigQuery, the more efficient bigquery-streaming-write-loader is typically used instead.

Specialized Writers

  • USAGE_BUS_WRITER: Listens to the usage events subscription and writes usage data to the usage events store (typically BigQuery). This tracks feature computation costs and usage patterns.
  • QUERY_MIRROR_CONSUMER: Listens to the query log subscription and executes mirrored queries. This is useful for debugging, testing, and validation scenarios.
  • ONLINE_VECTOR_STORE_WRITER: Listens to the vector store subscription and sends batches to vector databases for similarity search and retrieval use cases.

Each writer type requires specific subscription IDs and topics to be configured in the common persistence specifications.

Pubsub vs Kafka

When using pubsub, topics and subscriptions are 2 separate entities, but for Kafka, we use the same topic for both publishing and subscribing. Additionally, we need to provide Kafka authentication credential, whereas pubsub uses its google identity to authenticate.

Configuration Options (Common)

In the JSON format, these fields are in the common_specs field.

Attributes
bus_backendstring
The backend to use for the bus. "KAFKA" for AWS, "PUBSUB" for GCP.
namespacestring
The namespace to deploy the background persistence writers in.
service_account_namestring
The service account to use for the background persistence writers.
secret_clientstring
The client to use for secrets. "AWS" for AWS Secrets Manager, "GCP" for GCP Secrets Manager.
kafka_dlq_topicstring
The topic to use for Kafka dead letter queue.

Configuration Options (Other)

Attributes
api_server_hoststring
The hostname of the API server.
kafka_sasl_secretstring
The cloud secret to use for Kafka authentication.
kafka_bootstrap_serversstring
The Kafka bootstrap servers to use as a comma separated list.
kafka_security_protocolstring
The Kafka security protocol to use.
kafka_sasl_mechanismstring
The Kafka SASL mechanism to use.
redis_is_clusteredstring
Whether Redis is clustered or not, if using a Redis online store.
snowflake_storage_integration_namestring
The name of the Snowflake storage integration.
metadata_providerstring
The metadata provider to use ("GRPC_SERVER")

Configuration Options (Writers)

Attributes
namestring
The name of the writer.
bus_subscriber_typestring
The type of bus subscriber to use.
requestobject
The resource requests for the writer.
limitobject
The resource limits for the writer.
versionstring
The version of the writer.
default_replica_countint
The default number of replicas to create for the writer.

Configuration Options (Shared Writer Fields)

In the JSON format, these fields are in the common_specs field but are not necessarily required. Writers will each require an image and some, but not all, of the subscription and topic ID’s.

In each writer’s specification form, a writer will ask for its required fields and images.

Attributes
bus_writer_image_gostring
The docker image to use for Go bus writers.
bus_writer_image_pythonstring
The docker image to use for Python bus writers.
bus_writer_image_bswlstring
The docker image to use for bigquery streaming bus writers.
bigquery_parquet_upload_subscription_idstring
The subscription ID used to read parquet files to into the offline store.
bigquery_streaming_write_subscription_idstring
The subscription ID to use for streaming writes to the offline store.
bigquery_streaming_write_topicstring
' The topic to use for streaming writes to the offline store.
bigquery_upload_bucketstring
The S3 bucket to use for uploading files to the offline store.
bigquery_upload_topicstring
The topic to use for uploading files to the offline store.
metrics_bus_subscription_idstring
The subscription ID to use for metrics bus.
metrics_bus_topic_idstring
The topic ID to use for metrics bus.
result_bus_metrics_subscription_idstring
The subscription ID to use for result bus metrics.
result_bus_offline_store_subscription_idstring
The subscription ID to use for offline store result bus.
result_bus_online_store_subscription_idstring
The subscription ID to use for online store result bus.

Example Configuration

The following is an example configuration for background persistence writers:

{
  "common_persistence_specs": {
    "bus_backend": "KAFKA",
    "bus_writer_image_go": "<go bus writer image>",
    "bus_writer_image_python": "<python bus writer image>",
    "bus_writer_image_bswl": "<bswl bus writer image>",
    "namespace": "background-persistence",
    "service_account_name": "background-persistence-sa",
    "secret_client": "AWS",
    "bigquery_parquet_upload_subscription_id": "offline-store-bulk-insert-bus-1",
    "bigquery_streaming_write_subscription_id": "offline-store-streaming-insert-bus-1",
    "bigquery_streaming_write_topic": "offline-store-streaming-insert-bus-1",
    "bigquery_upload_bucket": "s3://<your data bucket>",
    "bigquery_upload_topic": "offline-store-bulk-insert-bus-1",
    "metrics_bus_subscription_id": "metrics-bus-1",
    "metrics_bus_topic_id": "metrics-bus-1",
    "result_bus_metrics_subscription_id": "result-bus-1",
    "result_bus_offline_store_subscription_id": "result-bus-1",
    "result_bus_online_store_subscription_id": "result-bus-1",
    "kafka_dlq_topic": "dlq-1",
    "operation_subscription_id": "operation-bus-1"
  },
  "api_server_host": "<your api server here>",
  "kafka_sasl_secret": "<your aws kafka auth secret here>",
  "kafka_bootstrap_servers": "<bootstrap server1>:<port>, <bootstrap server2>:<port>, ...",
  "kafka_security_protocol": "SASL_SSL",
  "kafka_sasl_mechanism": "SCRAM-SHA-512",
  "redis_is_clustered": "1",
  "snowflake_storage_integration_name": "<snowflak integration name>",
  "metadata_provider": "GRPC_SERVER",
  "writers": [
    {
      "name": "go-metrics-bus-writer",
      "bus_subscriber_type": "GO_METRICS_BUS_WRITER",
      "request": {
        "cpu": "200m",
        "memory": "512Mi"
      },
      "limit": {
        "cpu": "1",
        "memory": "512Mi"
      },
      "version": "1.0",
      "default_replica_count": 1
    },
    {
      "name": "go-result-bus-metrics-writer",
      "bus_subscriber_type": "GO_RESULT_BUS_METRICS_WRITER",
      "request": {
        "cpu": "400m",
        "memory": "1024Mi"
      },
      "limit": {
        "cpu": "1",
        "memory": "1024Mi"
      },
      "version": "1.0",
      "default_replica_count": 1
    }
  ]
}