# Performance Profiling
source: https://docs.chalk.ai/docs/profiling

## Collect CPU profiles and system traces from engine nodes and upload them to cloud storage for performance analysis.

Chalk supports two profiling tools to help diagnose performance bottlenecks in your deployment:

- Perf profiling — samples CPU activity using Linux perf and uploads gzip-compressed perf script output to your bucket.
- Perfetto tracing — captures system-wide traces using Perfetto and uploads them in Perfetto's native proto format.

Both are configured as observability daemons in your
background persistence configuration. Once enabled,
a daemon runs on each Kubernetes node and periodically uploads results to your bucket.

### Choosing which tool to use

Perfetto offers a superset of the features that perf offers, but at the cost of some performance. With Perfetto, you can collect not
only CPU profile information about the nodes that Chalk runs on, but also correlate those with traces emitted by the Chalk Engine.

It is recommended to start with Perf profiling only, and move over to using Perfetto if Perf does not provide enough information.

For Perfetto to work, in addition to configuring the profiling in the background persistence (as is outlined in this guide), you also need
to:

- Deploy a version of the engine with Perfetto traces compiled in (any O2 image will do)
- Enable trace outputs from your engine by setting the environment variable LIBCHALK_ENABLE_PERFETTO_TRACING=true

### Prerequisites

- A running Chalk deployment with background persistence configured. To install, please follow this guide
- An S3 or GCS bucket for storing profiles and traces
- Write access from the background persistence service account to that bucket

### Cloud storage permissions

GCP: The service account used by background persistence
(configured via service_account_name in common_persistence_specs) needs write
access to the target GCS bucket. You also need google_cloud_project set
in common_persistence_specs.

AWS: Background persistence pods obtain AWS credentials through
IAM Roles for Service Accounts (IRSA).
The default IAM role is provisioned with broad S3 access within your account,
so any same-account bucket will work without additional configuration.
To write to a bucket in a different AWS account, add a bucket policy on the
destination bucket granting access to the background persistence IAM role ARN.

### Perf profiling

The perf collector samples CPU activity for Chalk-related processes on each node and
periodically uploads gzip-compressed perf script output to your bucket.

### Enabling perf profiling

There are two ways to enable perf profiling—through the Chalk dashboard if you manage your
background persistence deployments there, or using the Chalk CLI.

### Using the Chalk dashboard

- Navigate to Settings > Shared Resources > Background Persistence.
- Click Edit JSON.
- Add an observability_daemons array at the top level of the JSON,
alongside the existing common_persistence_specs and writers:

```
"observability_daemons": [
  {
    "keep_running_when_suspended": true,
    "perf_collector": {
      "perf_polling_frequency_hz": 99,
      "call_graph": true,
      "max_dumps_retained": 3,
      "dump_duration_seconds": 120,
      "bucket_subdirectory": "perf-data",
      "export_to": "s3://your-bucket-name"
    }
  }
]
```

- Replace s3://your-bucket-name with your bucket URI. Use gs:// for GCS.
The subdirectory name can be anything you like.
- Save and apply.

### Using the Chalk CLI

With the Chalk CLI, you can export your current configuration, edit it, and re-apply to your background persistence deployment.

```
chalk infra describe persistence --json > persistence.json
```

Open persistence.json and add an observability_daemons array at the top level,
alongside the existing common_persistence_specs and writers:

```
"observability_daemons": [
  {
    "keep_running_when_suspended": true,
    "perf_collector": {
      "perf_polling_frequency_hz": 99,
      "call_graph": true,
      "max_dumps_retained": 3,
      "dump_duration_seconds": 120,
      "bucket_subdirectory": "perf-data",
      "export_to": "s3://your-bucket-name"
    }
  }
]
```

Replace s3://your-bucket-name with your bucket URI. Use gs:// for GCS.
Then apply:

```
chalk infra apply persistence -f persistence.json
```

The CLI will show a diff of the changes and prompt for confirmation.

### Configuration reference

Sampling frequency in Hz. 99 is a standard default that avoids aliasing with timer interrupts.

Capture full call stacks with each sample. Required for flame graph generation. Defaults to true.

Maximum number of profile files to keep on disk per node. Older files are uploaded and deleted.

How often, in seconds, perf record rotates its output file and the cleanup loop runs. Defaults to 60.

Bucket URI for uploads. Use s3://bucket-name for AWS, or gs://bucket-name for GCS.

Path prefix within the bucket. Files are organized as .

### Where to find profiles

Profiles appear in your bucket organized by node name:

```
BUCKET_SUBDIRECTORY/NODE_NAME/TIMESTAMP-perf-data.gz
```

For example, with bucket_subdirectory set to "perf-data" and a node named
ip-10-0-1-42.ec2.internal:

```
s3://your-bucket-name/perf-data/ip-10-0-1-42.ec2.internal/20260220T153012-perf-data.gz
```

Each file contains gzip-compressed perf script output, filtered to Chalk-related processes.

### Tuning overhead

Perf profiling adds CPU and I/O overhead to your nodes. You can adjust
the following settings to find the right balance:

- Lower the sampling frequency: Reduce perf_polling_frequency_hz (e.g., from
99 to 49).
- Increase dump duration: A larger dump_duration_seconds produces fewer, larger
files and reduces I/O frequency.

### Perfetto tracing

The Perfetto daemon captures system-wide traces on each node. Traces can be triggered
on a fixed time interval or on-demand via an HTTP endpoint that the Chalk CLI can call.
Trace files are in Perfetto's native proto format and can be opened directly in the
Perfetto UI.

### Enabling Perfetto tracing

There are two ways to enable Perfetto tracing—through the Chalk dashboard if you manage
your background persistence deployments there, or using the Chalk CLI.

### Using the Chalk dashboard

- Navigate to Settings > Shared Resources > Background Persistence.
- Click Edit JSON.
- Add an observability_daemons array at the top level of the JSON,
alongside the existing common_persistence_specs and writers:

```
"observability_daemons": [
  {
    "keep_running_when_suspended": true,
    "perfetto_daemon": {
      "trigger": "PERFETTO_TRIGGER_TIME_INTERVAL",
      "interval": 60000,
      "max_retained_runs": 3,
      "bucket_subdirectory": "perfetto-traces",
      "export_to": "s3://your-bucket-name",
      "trigger_name": "chalk_traces",
      "config_text": "buffers: {\n  size_kb: 102400\n  fill_policy: RING_BUFFER\n}\n\ndata_sources: {\n  config {\n    name: \"linux.perf\"\n    perf_event_config {\n      all_cpus: true\n      sampling_frequency: 100\n    }\n  }\n}\n\ntrigger_config {\n  trigger_mode: CLONE_SNAPSHOT\n  triggers {\n    name: \"chalk_traces\"\n    stop_delay_ms: 1000\n  }\n}\n"
    }
  }
]
```

- Replace s3://your-bucket-name with your bucket URI. Use gs:// for GCS.
Replace the config_text value with a valid Perfetto text proto config.
- Save and apply.

### Using the Chalk CLI

With the Chalk CLI, you can export your current configuration, edit it, and re-apply to your
background persistence deployment.

```
chalk infra describe persistence --json > persistence.json
```

Open persistence.json and add an observability_daemons array at the top level,
alongside the existing common_persistence_specs and writers:

```
"observability_daemons": [
  {
    "keep_running_when_suspended": true,
    "perfetto_daemon": {
      "trigger": "PERFETTO_TRIGGER_TIME_INTERVAL",
      "interval": 60000,
      "max_retained_runs": 3,
      "bucket_subdirectory": "perfetto-traces",
      "export_to": "s3://your-bucket-name",
      "trigger_name": "chalk_traces",
      "config_text": "buffers: {\n  size_kb: 102400\n  fill_policy: RING_BUFFER\n}\n\ndata_sources: {\n  config {\n    name: \"linux.perf\"\n    perf_event_config {\n      all_cpus: true\n      sampling_frequency: 100\n    }\n  }\n}\n\ntrigger_config {\n  trigger_mode: CLONE_SNAPSHOT\n  triggers {\n    name: \"chalk_traces\"\n    stop_delay_ms: 1000\n  }\n}\n"
    }
  }
]
```

Replace s3://your-bucket-name with your bucket URI. Use gs:// for GCS.
Then apply:

```
chalk infra apply persistence -f persistence.json
```

The CLI will show a diff of the changes and prompt for confirmation.

### Generating Perfetto text proto config

The config_text field must contain a valid Perfetto text proto config.

Regardless of which trigger mode you use, the config must include a trigger_config block with
trigger_mode: CLONE_SNAPSHOT and a trigger whose name exactly matches the trigger_name field
in your daemon config. This is how Perfetto knows when to snapshot the ring buffer and emit a trace.

Write your config in a .pbtxt file. For example, a config that samples CPU at 99 Hz and snapshots
on the trigger "chalk_traces" would look like:

```
buffers: {
  size_kb: 102400
  fill_policy: RING_BUFFER
}

data_sources: {
  config {
    name: "linux.perf"
    perf_event_config {
      all_cpus: true
      sampling_frequency: 99
    }
  }
}

trigger_config {
  trigger_mode: CLONE_SNAPSHOT
  triggers {
    name: "chalk_traces"
    stop_delay_ms: 1000
  }
}
```

Because config_text is embedded as a JSON string, newlines and quotes must be escaped.
Use jq to produce the correctly escaped value from your .pbtxt file:

```
jq -Rs '.' < perfetto.pbtxt
```

This prints the file contents as a quoted, escaped JSON string. Copy the output (including
the surrounding quotes) and use it as the config_text value in your persistence config.

### Trigger modes

The Perfetto daemon supports two ways to initiate a trace capture:

- PERFETTO_TRIGGER_TIME_INTERVAL — Traces are collected automatically on a fixed interval
(controlled by the interval field). Use this for continuous background profiling.
- PERFETTO_TRIGGER_HTTP — An HTTP endpoint is exposed on port 3565. The cluster manager
can call this endpoint to trigger a trace on demand. Use this when you want to capture a trace
at a specific moment, such as during a known slow request. At most one HTTP-triggered Perfetto
daemon may be configured per environment.

When PERFETTO_TRIGGER_HTTP is used, the cluster manager is automatically configured with the
CHALK_PERFETTO_DAEMON_PORT and CHALK_PERFETTO_DAEMON_NAMESPACE environment variables.
You can then trigger a snapshot with:

```
chalk profiling perfetto-snapshot
```

Regardless of the trigger mode you use for the daemon, the underlying Perfetto config needs to use trigger_mode: CLONE_SNAPSHOT for
the system to work properly.

### On-demand tracing (HTTP trigger)

To enable on-demand tracing via chalk profiling perfetto-snapshot, use
PERFETTO_TRIGGER_HTTP as the trigger mode and set a trigger_name:

```
"observability_daemons": [
  {
    "keep_running_when_suspended": true,
    "perfetto_daemon": {
      "trigger": "PERFETTO_TRIGGER_HTTP",
      "trigger_name": "chalk_snapshot",
      "max_retained_runs": 5,
      "bucket_subdirectory": "perfetto-traces",
      "export_to": "gs://your-bucket-name",
      "config_text": "buffers: {\n  size_kb: 102400\n  fill_policy: RING_BUFFER\n}\n\ndata_sources: {\n  config {\n    name: \"linux.perf\"\n    perf_event_config {\n      all_cpus: true\n      sampling_frequency: 100\n    }\n  }\n}\n\ntrigger_config {\n  trigger_mode: CLONE_SNAPSHOT\n  triggers {\n    name: \"chalk_snapshot\"\n    stop_delay_ms: 1000\n  }\n}\n"
    }
  }
]
```

Once deployed, trigger a trace capture with:

```
chalk profiling perfetto-snapshot
```

### Configuration reference

Perfetto tracing configuration in text proto format (.pbtxt). This is required. See the
Perfetto config documentation for available
data sources and options.

How traces are initiated. Use PERFETTO_TRIGGER_TIME_INTERVAL for automatic periodic
collection, or PERFETTO_TRIGGER_HTTP for on-demand collection via
chalk profiling perfetto-snapshot.

Interval between traces in milliseconds. This is also how frequently the system will be scanned for new
trace files to upload.

Perfetto trigger name. Required. Must exactly match the trigger name in the
trigger_config block of config_text.

Maximum number of trace files to keep on disk per node. Older files are uploaded and deleted. Recommended
to set this to 0 to start uploading immediately.

Bucket URI for uploads. Use s3://bucket-name for AWS, or gs://bucket-name for GCS.

Path prefix within the bucket. Files are organized as .

### Where to find traces

Traces appear in your bucket organized by node name:

```
BUCKET_SUBDIRECTORY/NODE_NAME/TIMESTAMP-perfetto-trace.pb
```

For example, with bucket_subdirectory set to "perfetto-traces" and a node named
ip-10-0-1-42.ec2.internal:

```
s3://{your-bucket-name}/perfetto-traces/ip-10-0-1-42.ec2.internal/20260220T153012-perfetto-trace.pb
```

### Common configuration

The following fields apply to both the perf_collector and perfetto_daemon daemon objects.

Keep the daemon running when background persistence is suspended.

Kubernetes resource requests (cpu, memory). Defaults to 25m CPU and 64Mi memory.

Kubernetes resource limits (cpu, memory).

Custom container image. Omit to use the default.

### Sending data to Chalk

Once data has been collected, download it from your bucket, compress it into an archive,
and send it to your Chalk support contact.

### Perf profiles

For AWS:

```
aws s3 sync s3://your-bucket-name/perf-data/ ./perf-data/
tar czf perf-profiles.tar.gz perf-data/
```

For GCS:

```
gcloud storage rsync -r gs://your-bucket-name/perf-data/ ./perf-data/
tar czf perf-profiles.tar.gz perf-data/
```

### Perfetto traces

For AWS:

```
aws s3 sync s3://your-bucket-name/perfetto-traces/ ./perfetto-traces/
tar czf perfetto-traces.tar.gz perfetto-traces/
```

For GCS:

```
gcloud storage rsync -r gs://your-bucket-name/perfetto-traces/ ./perfetto-traces/
tar czf perfetto-traces.tar.gz perfetto-traces/
```

### Disabling profiling

When profiling is no longer needed, remove the observability_daemons entry from
your background persistence configuration and re-apply. The profiling daemonset will
be removed automatically.

In the dashboard, navigate to Settings > Shared Resources > Background Persistence,
click Edit JSON, delete the observability_daemons block, and save.

With the CLI:

```
chalk infra describe persistence --json > persistence.json
# Remove the observability_daemons array from persistence.json
chalk infra apply persistence -f persistence.json
```