Telemetry FAQ

Telemetry is the observability pipeline that powers the logs and traces you see in the Chalk dashboard: the structured-logs view, the query-logs view, and the Trace tab on the Online Query page. It runs entirely inside your Kubernetes cluster, and almost all of it is configurable under Settings > Shared Resources > Telemetry. This page answers the most common questions about why telemetry might not behave as expected, and how to resolve them yourself.

Telemetry covers logs and traces, not metrics. Metrics flow through a separate pipeline (the Metrics Database / TimescaleDB) configured under Settings > Shared Resources > Metrics Database. If a metric or chart is missing, see Metrics Monitor and Metrics Export instead.

How telemetry works

Logs and traces flow through four stages, all running in your cluster in the chalk-telemetry namespace:

Collector (OpenTelemetry or Vector) → Aggregator → ClickHouse → Dashboard

The collector receives logs and traces emitted by the engine and resolvers. These pods are named chalk-collector-*.
The aggregator batches and forwards them. This pod is named chalk-telemetry-aggregator-*.
ClickHouse is the database that stores the logs and traces. This is the clickhouse-0 pod. Traces have a default TTL of 7 days, which Chalk shortens automatically as storage fills.
The dashboard reads from ClickHouse to render the logs views and the Trace tab.

Because the whole pipeline is just a set of pods, most telemetry problems are infrastructure problems (a pod that can’t schedule, or one that’s out of memory) rather than bugs in your features or resolvers. The two places you’ll work from are:

Settings > Shared Resources > Telemetry: view and edit the deployment (CPU/memory for ClickHouse, the collector, and the aggregator; ClickHouse storage size; collector runtime; and more). The guided form and a raw JSON editor are both available here.
Infrastructure > Kubernetes: inspect pod health. On the All Pods tab, set the namespace filter to chalk-telemetry to see only the telemetry pods and their status, restarts, and age. Click a pod to view its logs.

After any change on the Telemetry settings page, Save & Apply and then watch the change through to resolution. Confirm the affected pods actually reschedule and become healthy. Reconfiguring an infrastructure component and walking away is the most common way a small change turns into an outage.

Why are my logs or traces not showing up in the dashboard?

Logs and traces share the same pipeline, so the first checks are the same:

Did the telemetry settings change recently? A recent edit to the ClickHouse or aggregator resources on the Settings > Shared Resources > Telemetry page is the most common trigger.
Are the telemetry pods healthy? Go to Infrastructure > Kubernetes, filter the namespace to chalk-telemetry, and confirm clickhouse-0, the chalk-collector-* pods, and the chalk-telemetry-aggregator-* pod are all Running. If any are Pending, OOMKilled, or in CrashLoopBackOff, see the next question.
For traces specifically, also confirm tracing is actually enabled and sampled. See below.

Note the 7-day trace TTL: traces older than roughly a week are expected to age out, and the window can be shorter if ClickHouse storage filled up. A missing old trace is usually expected behavior, not a failure.

What does "ClickHouse server is not currently reachable" mean?

You may see:

“Clickhouse server is not currently reachable. It is likely that the instance is under maintenance or the pod has been moved to a new node. Please try again in a few minutes…”

This is the most frequent telemetry error. It appears when the logs / traces views try to query ClickHouse and can’t reach it.

It only affects your ability to view telemetry in the dashboard. Online and offline query serving and feature computation are completely unaffected, and your application keeps working normally.

Causes, in order of likelihood:

Transient unavailability. ClickHouse was briefly down for maintenance, or its pod was just rescheduled onto a new node. Wait a few minutes and refresh. If it clears, no action is needed.
The ClickHouse (or aggregator) pod can’t schedule. Check the clickhouse-0 pod in the chalk-telemetry namespace (Infrastructure > Kubernetes, namespace filter chalk-telemetry). If it’s Pending with FailedScheduling events, its resource request is too large for the node it targets. The classic mistake is setting a memory request equal to the node’s total memory (for example, a 32Gi request on a 32Gi m6a.2xlarge). Daemonsets and system overhead consume part of every node, so a request that equals node capacity can never be placed and the pod stays Pending forever. Pinning a fixed instance type and a high request makes this especially easy to trip.
OOM / CrashLoopBackOff. The aggregator can run out of memory under heavy query volume, and ClickHouse can restart under memory pressure. Repeated OOMKilled status means memory needs to go up, but only to a value that still fits the node.

Self-serve fix: go to Settings > Shared Resources > Telemetry and adjust the memory in the ClickHouse and/or Aggregator sections.

If a pod can’t schedule: lower the request so it fits the node (leave headroom below the instance’s total memory, for example request 30Gi on a 32Gi machine rather than 32Gi), or raise the instance type. Never set the request equal to the machine size.
If a pod is OOMing: raise the memory request for the resource that ran out of memory (the aggregator is often the right knob), staying under node capacity.

Then Save & Apply, confirm clickhouse-0 reaches Running, and reload the dashboard. If it still can’t reach ClickHouse after about 10 minutes, contact Chalk support.

Leave Serve over HTTP disabled in the ClickHouse settings unless the Chalk team explicitly asks you to enable it, since it can break ClickHouse connectivity.

I added `--trace` but no trace appears

Work through these in order:

Check the flag syntax. It is --trace, with no space. Writing chalk query … -- trace (a space after --) is parsed as “end of flags” plus a stray positional argument, so tracing is silently not enabled and you get no error.
Check sampling. Tracing across your environment is governed by OTEL_TRACES_SAMPLER and OTEL_TRACES_SAMPLER_ARG (Settings > Integrations > Environment Variables). The defaults are parentbased_traceidratio at 0.01 (1%). If the sampler is set to always_off, query-time flags are ignored and nothing is traced. With a parentbased_* sampler you can still force a single query with --trace (CLI) or trace=True (ChalkClient). See Tracing for the full table of samplers.
Check the client. Tracing requires the chalkpy[tracing] extra and chalkpy >= 2.95.9.
View it in the right place. A traced query’s waterfall appears in the Trace tab of the Online Query page.

How do I use traces to find a slow query?

Once tracing works, traces are the primary tool for locating latency. Enable tracing on a representative query, open the Trace tab, and look for the longest spans. The waterfall shows where time is spent and what called what. For example, it distinguishes fast Redis reads from slow warehouse or Postgres round trips.

From there:

Target the offending data source (add indices, increase its CPU/memory, or add read replicas).
Accelerate any unaccelerated resolvers that dominate the trace. See Static Resolver Optimization.
Consider querying through the gRPC client (ChalkGRPCClient), which is generally faster than HTTP.

For the broader set of tools for debugging queries (Query Plan Visualizer, Resolver Replay, logging), see Debugging Queries.

When should I contact Chalk support?

Reach out to Chalk support if:

The “not reachable” error persists beyond about 10 minutes after confirming the telemetry pods are healthy.
Pods keep OOMing or restarting even after correctly sizing memory within node limits.
ClickHouse won’t schedule and it’s unclear which combination of request and instance type will fit.
You’re considering Serve over HTTP or other advanced fields. Confirm with the Chalk team first.

Additional resources

Tracing: how to enable tracing, sampling configuration, and viewing the Trace tab.
Observability Overview: the full observability section, including metrics and alerting.
Log Export: export logs to external monitoring systems and the log field schema.
Debugging Queries: Query Plan Visualizer, Resolver Replay, and logging.
Kubernetes Resources Overview: how Chalk uses Kubernetes resources.

​How telemetry works

​Why are my logs or traces not showing up in the dashboard?

​What does "ClickHouse server is not currently reachable" mean?

​I added --trace but no trace appears

​How do I use traces to find a slow query?

​When should I contact Chalk support?

​Additional resources