Observability
Troubleshoot missing logs and traces, and other telemetry pipeline issues.
Telemetry is the observability pipeline that powers the logs and traces you see in the Chalk
dashboard: the structured-logs view, the query-logs view, and the Trace tab on the Online Query page.
It runs entirely inside your Kubernetes cluster, and almost all of it is configurable under
Settings > Shared Resources > Telemetry. This page answers the most common questions about why telemetry
might not behave as expected, and how to resolve them yourself.
Telemetry covers logs and traces, not metrics. Metrics flow through a separate pipeline (the Metrics Database / TimescaleDB) configured under
Settings > Shared Resources > Metrics Database. If a metric or chart is missing, see Metrics Monitor and Metrics Export instead.
Logs and traces flow through four stages, all running in your cluster in the chalk-telemetry namespace:
Collector (OpenTelemetry or Vector) → Aggregator → ClickHouse → Dashboard
chalk-collector-*.chalk-telemetry-aggregator-*.clickhouse-0 pod. Traces have
a default TTL of 7 days, which Chalk shortens automatically as storage fills.Because the whole pipeline is just a set of pods, most telemetry problems are infrastructure problems (a pod that can’t schedule, or one that’s out of memory) rather than bugs in your features or resolvers. The two places you’ll work from are:
Settings > Shared Resources > Telemetry: view and edit the deployment (CPU/memory for ClickHouse, the
collector, and the aggregator; ClickHouse storage size; collector runtime; and more). The guided form and a
raw JSON editor are both available here.Infrastructure > Kubernetes: inspect pod health. On the All Pods tab, set the namespace filter to
chalk-telemetry to see only the telemetry pods and their status, restarts, and age. Click a pod to view its
logs.After any change on the Telemetry settings page, Save & Apply and then watch the change through to resolution. Confirm the affected pods actually reschedule and become healthy. Reconfiguring an infrastructure component and walking away is the most common way a small change turns into an outage.
Logs and traces share the same pipeline, so the first checks are the same:
Infrastructure > Kubernetes, filter the namespace to
chalk-telemetry, and confirm clickhouse-0, the chalk-collector-* pods, and the
chalk-telemetry-aggregator-* pod are all Running. If any are Pending, OOMKilled, or in
CrashLoopBackOff, see the next question.Note the 7-day trace TTL: traces older than roughly a week are expected to age out, and the window can be shorter if ClickHouse storage filled up. A missing old trace is usually expected behavior, not a failure.
You may see:
“Clickhouse server is not currently reachable. It is likely that the instance is under maintenance or the pod has been moved to a new node. Please try again in a few minutes…”
This is the most frequent telemetry error. It appears when the logs / traces views try to query ClickHouse and can’t reach it.
It only affects your ability to view telemetry in the dashboard. Online and offline query serving and feature computation are completely unaffected, and your application keeps working normally.
Causes, in order of likelihood:
clickhouse-0 pod in the chalk-telemetry
namespace (Infrastructure > Kubernetes, namespace filter chalk-telemetry). If it’s Pending with
FailedScheduling events, its resource request is too large for the node it targets. The classic mistake
is setting a memory request equal to the node’s total memory (for example, a 32Gi request on a 32Gi
m6a.2xlarge). Daemonsets and system overhead consume part of every node, so a request that equals node
capacity can never be placed and the pod stays Pending forever. Pinning a fixed instance type and a high
request makes this especially easy to trip.OOMKilled status means memory needs to go up, but only to a value
that still fits the node.Self-serve fix: go to Settings > Shared Resources > Telemetry and adjust the memory in the ClickHouse
and/or Aggregator sections.
Then Save & Apply, confirm clickhouse-0 reaches Running, and reload the dashboard. If it still can’t
reach ClickHouse after about 10 minutes, contact Chalk support.
Leave Serve over HTTP disabled in the ClickHouse settings unless the Chalk team explicitly asks you to enable it, since it can break ClickHouse connectivity.
--trace but no trace appearsWork through these in order:
--trace, with no space. Writing chalk query … -- trace (a space after
--) is parsed as “end of flags” plus a stray positional argument, so tracing is silently not enabled and
you get no error.OTEL_TRACES_SAMPLER and OTEL_TRACES_SAMPLER_ARG
(Settings > Integrations > Environment Variables). The defaults are parentbased_traceidratio at 0.01
(1%). If the sampler is set to always_off, query-time flags are ignored and nothing is traced.
With a parentbased_* sampler you can still force a single query with --trace (CLI) or trace=True
(ChalkClient). See Tracing for the full table of samplers.chalkpy[tracing] extra and chalkpy >= 2.95.9.Once tracing works, traces are the primary tool for locating latency. Enable tracing on a representative query, open the Trace tab, and look for the longest spans. The waterfall shows where time is spent and what called what. For example, it distinguishes fast Redis reads from slow warehouse or Postgres round trips.
From there:
ChalkGRPCClient), which is generally faster than HTTP.For the broader set of tools for debugging queries (Query Plan Visualizer, Resolver Replay, logging), see Debugging Queries.
Reach out to Chalk support if: