Infrastructure
Collect CPU profiles and system traces from engine nodes and upload them to cloud storage for performance analysis.
Chalk supports two profiling tools to help diagnose performance bottlenecks in your deployment:
perf and uploads gzip-compressed perf script output to your bucket.Both are configured as observability daemons in your background persistence configuration. Once enabled, a daemon runs on each Kubernetes node and periodically uploads results to your bucket.
Perfetto offers a superset of the features that perf offers, but at the cost of some performance. With Perfetto, you can collect not only CPU profile information about the nodes that Chalk runs on, but also correlate those with traces emitted by the Chalk Engine.
It is recommended to start with Perf profiling only, and move over to using Perfetto if Perf does not provide enough information.
For Perfetto to work, in addition to configuring the profiling in the background persistence (as is outlined in this guide), you also need
to enable trace outputs from your engine by setting the environment variable LIBCHALK_ENABLE_PERFETTO_TRACING=true.
GCP: The service account used by background persistence
(configured via service_account_name in common_persistence_specs) needs write
access to the target GCS bucket. You also need google_cloud_project set
in common_persistence_specs.
AWS: Background persistence pods obtain AWS credentials through IAM Roles for Service Accounts (IRSA). The default IAM role is provisioned with broad S3 access within your account, so any same-account bucket will work without additional configuration. To write to a bucket in a different AWS account, add a bucket policy on the destination bucket granting access to the background persistence IAM role ARN.
The perf collector samples CPU activity for Chalk-related processes on each node and
periodically uploads gzip-compressed perf script output to your bucket.
There are two ways to enable perf profiling—through the Chalk dashboard if you manage your background persistence deployments there, or using the Chalk CLI.
observability_daemons array at the top level of the JSON,
alongside the existing common_persistence_specs and writers:"observability_daemons": [
{
"keep_running_when_suspended": true,
"perf_collector": {
"perf_polling_frequency_hz": 99,
"call_graph": true,
"max_dumps_retained": 3,
"dump_duration_seconds": 120,
"bucket_subdirectory": "perf-data",
"export_to": "s3://your-bucket-name"
}
}
]s3://your-bucket-name with your bucket URI. Use gs:// for GCS.
The subdirectory name can be anything you like.With the Chalk CLI, you can export your current configuration, edit it, and re-apply to your background persistence deployment.
chalk infra describe persistence --json > persistence.jsonOpen persistence.json and add an observability_daemons array at the top level,
alongside the existing common_persistence_specs and writers:
"observability_daemons": [
{
"keep_running_when_suspended": true,
"perf_collector": {
"perf_polling_frequency_hz": 99,
"call_graph": true,
"max_dumps_retained": 3,
"dump_duration_seconds": 120,
"bucket_subdirectory": "perf-data",
"export_to": "s3://your-bucket-name"
}
}
]Replace s3://your-bucket-name with your bucket URI. Use gs:// for GCS.
Then apply:
chalk infra apply persistence -f persistence.jsonThe CLI will show a diff of the changes and prompt for confirmation.
perf_polling_frequency_hzinteger99 is a standard default that avoids aliasing with timer interrupts.call_graphbooleantrue.max_dumps_retainedintegerdump_duration_secondsintegerperf record rotates its output file and the cleanup loop runs. Defaults to 60.export_tostrings3://bucket-name for AWS, or gs://bucket-name for GCS.bucket_subdirectorystringbucket_subdirectory/node-name/.Profiles appear in your bucket organized by node name:
BUCKET_SUBDIRECTORY/NODE_NAME/TIMESTAMP-perf-data.gz
For example, with bucket_subdirectory set to "perf-data" and a node named
ip-10-0-1-42.ec2.internal:
s3://your-bucket-name/perf-data/ip-10-0-1-42.ec2.internal/20260220T153012-perf-data.gz
Each file contains gzip-compressed perf script output, filtered to Chalk-related processes.
Perf profiling adds CPU and I/O overhead to your nodes. You can adjust the following settings to find the right balance:
perf_polling_frequency_hz (e.g., from
99 to 49).dump_duration_seconds produces fewer, larger
files and reduces I/O frequency.The Perfetto daemon captures system-wide traces on each node. Traces can be triggered on a fixed time interval or on-demand via an HTTP endpoint that the Chalk CLI can call. Trace files are in Perfetto’s native proto format and can be opened directly in the Perfetto UI.
There are two ways to enable Perfetto tracing—through the Chalk dashboard if you manage your background persistence deployments there, or using the Chalk CLI.
observability_daemons array at the top level of the JSON,
alongside the existing common_persistence_specs and writers:"observability_daemons": [
{
"keep_running_when_suspended": true,
"perfetto_daemon": {
"trigger": "PERFETTO_TRIGGER_TIME_INTERVAL",
"interval": 60000,
"max_retained_runs": 3,
"bucket_subdirectory": "perfetto-traces",
"export_to": "s3://your-bucket-name",
"trigger_name": "chalk_traces",
"config_text": "buffers: {\n size_kb: 102400\n fill_policy: RING_BUFFER\n}\n\ndata_sources: {\n config {\n name: \"linux.perf\"\n perf_event_config {\n all_cpus: true\n sampling_frequency: 100\n }\n }\n}\n\ntrigger_config {\n trigger_mode: CLONE_SNAPSHOT\n triggers {\n name: \"chalk_traces\"\n stop_delay_ms: 1000\n }\n}\n"
}
}
]s3://your-bucket-name with your bucket URI. Use gs:// for GCS.
Replace the config_text value with a valid Perfetto text proto config.With the Chalk CLI, you can export your current configuration, edit it, and re-apply to your background persistence deployment.
chalk infra describe persistence --json > persistence.jsonOpen persistence.json and add an observability_daemons array at the top level,
alongside the existing common_persistence_specs and writers:
"observability_daemons": [
{
"keep_running_when_suspended": true,
"perfetto_daemon": {
"trigger": "PERFETTO_TRIGGER_TIME_INTERVAL",
"interval": 60000,
"max_retained_runs": 3,
"bucket_subdirectory": "perfetto-traces",
"export_to": "s3://your-bucket-name",
"trigger_name": "chalk_traces",
"config_text": "buffers: {\n size_kb: 102400\n fill_policy: RING_BUFFER\n}\n\ndata_sources: {\n config {\n name: \"linux.perf\"\n perf_event_config {\n all_cpus: true\n sampling_frequency: 100\n }\n }\n}\n\ntrigger_config {\n trigger_mode: CLONE_SNAPSHOT\n triggers {\n name: \"chalk_traces\"\n stop_delay_ms: 1000\n }\n}\n"
}
}
]Replace s3://your-bucket-name with your bucket URI. Use gs:// for GCS.
Then apply:
chalk infra apply persistence -f persistence.jsonThe CLI will show a diff of the changes and prompt for confirmation.
The config_text field must contain a valid Perfetto text proto config.
Regardless of which trigger mode you use, the config must include a trigger_config block with
trigger_mode: CLONE_SNAPSHOT and a trigger whose name exactly matches the trigger_name field
in your daemon config. This is how Perfetto knows when to snapshot the ring buffer and emit a trace.
Write your config in a .pbtxt file. For example, a config that samples CPU at 99 Hz and snapshots
on the trigger "chalk_traces" would look like:
buffers: {
size_kb: 102400
fill_policy: RING_BUFFER
}
data_sources: {
config {
name: "linux.perf"
perf_event_config {
all_cpus: true
sampling_frequency: 99
}
}
}
trigger_config {
trigger_mode: CLONE_SNAPSHOT
triggers {
name: "chalk_traces"
stop_delay_ms: 1000
}
}
Because config_text is embedded as a JSON string, newlines and quotes must be escaped.
Use jq to produce the correctly escaped value from your .pbtxt file:
jq -Rs '.' < perfetto.pbtxtThis prints the file contents as a quoted, escaped JSON string. Copy the output (including
the surrounding quotes) and use it as the config_text value in your persistence config.
The Perfetto daemon supports two ways to initiate a trace capture:
PERFETTO_TRIGGER_TIME_INTERVAL — Traces are collected automatically on a fixed interval
(controlled by the interval field). Use this for continuous background profiling.PERFETTO_TRIGGER_HTTP — An HTTP endpoint is exposed on port 3565. The cluster manager
can call this endpoint to trigger a trace on demand. Use this when you want to capture a trace
at a specific moment, such as during a known slow request. At most one HTTP-triggered Perfetto
daemon may be configured per environment.When PERFETTO_TRIGGER_HTTP is used, the cluster manager is automatically configured with the
CHALK_PERFETTO_DAEMON_PORT and CHALK_PERFETTO_DAEMON_NAMESPACE environment variables.
You can then trigger a snapshot with:
chalk profiling perfetto-snapshotRegardless of the trigger mode you use for the daemon, the underlying Perfetto config needs to use trigger_mode: CLONE_SNAPSHOT for
the system to work properly.
To enable on-demand tracing via chalk profiling perfetto-snapshot, use
PERFETTO_TRIGGER_HTTP as the trigger mode and set a trigger_name:
"observability_daemons": [
{
"keep_running_when_suspended": true,
"perfetto_daemon": {
"trigger": "PERFETTO_TRIGGER_HTTP",
"trigger_name": "chalk_snapshot",
"max_retained_runs": 5,
"bucket_subdirectory": "perfetto-traces",
"export_to": "gs://your-bucket-name",
"config_text": "buffers: {\n size_kb: 102400\n fill_policy: RING_BUFFER\n}\n\ndata_sources: {\n config {\n name: \"linux.perf\"\n perf_event_config {\n all_cpus: true\n sampling_frequency: 100\n }\n }\n}\n\ntrigger_config {\n trigger_mode: CLONE_SNAPSHOT\n triggers {\n name: \"chalk_snapshot\"\n stop_delay_ms: 1000\n }\n}\n"
}
}
]Once deployed, trigger a trace capture with:
chalk profiling perfetto-snapshotconfig_textstring.pbtxt). This is required. See the Perfetto config documentation for available data sources and options.triggerstringPERFETTO_TRIGGER_TIME_INTERVAL for automatic periodic collection, or PERFETTO_TRIGGER_HTTP for on-demand collection via chalk profiling perfetto-snapshot.intervalintegertrigger_namestringname in the trigger_config block of config_text.max_retained_runsintegerexport_tostrings3://bucket-name for AWS, or gs://bucket-name for GCS.bucket_subdirectorystringbucket_subdirectory/node-name/.Traces appear in your bucket organized by node name:
BUCKET_SUBDIRECTORY/NODE_NAME/TIMESTAMP-perfetto-trace.pb
For example, with bucket_subdirectory set to "perfetto-traces" and a node named
ip-10-0-1-42.ec2.internal:
s3://{your-bucket-name}/perfetto-traces/ip-10-0-1-42.ec2.internal/20260220T153012-perfetto-trace.pb
The following fields apply to both the perf_collector and perfetto_daemon daemon objects.
keep_running_when_suspendedbooleanrequestobjectcpu, memory). Defaults to 25m CPU and 64Mi memory.limitobjectcpu, memory).image_overridestringOnce data has been collected, download it from your bucket, compress it into an archive, and send it to your Chalk support contact.
For AWS:
aws s3 sync s3://your-bucket-name/perf-data/ ./perf-data/
tar czf perf-profiles.tar.gz perf-data/For GCS:
gcloud storage rsync -r gs://your-bucket-name/perf-data/ ./perf-data/
tar czf perf-profiles.tar.gz perf-data/For AWS:
aws s3 sync s3://your-bucket-name/perfetto-traces/ ./perfetto-traces/
tar czf perfetto-traces.tar.gz perfetto-traces/For GCS:
gcloud storage rsync -r gs://your-bucket-name/perfetto-traces/ ./perfetto-traces/
tar czf perfetto-traces.tar.gz perfetto-traces/When profiling is no longer needed, remove the observability_daemons entry from
your background persistence configuration and re-apply. The profiling daemonset will
be removed automatically.
In the dashboard, navigate to Settings > Shared Resources > Background Persistence,
click Edit JSON, delete the observability_daemons block, and save.
With the CLI:
chalk infra describe persistence --json > persistence.json
# Remove the observability_daemons array from persistence.json
chalk infra apply persistence -f persistence.json