# Telemetry FAQ
source: https://docs.chalk.ai/docs/telemetry-faq

## Troubleshoot missing logs and traces, and other telemetry pipeline issues.

Telemetry is the observability pipeline that powers the logs and traces you see in the Chalk
dashboard: the structured-logs view, the query-logs view, and the Trace tab on the Online Query page.
It runs entirely inside your Kubernetes cluster, and almost all of it is configurable under
Settings > Shared Resources > Telemetry. This page answers the most common questions about why telemetry
might not behave as expected, and how to resolve them yourself.

### How telemetry works

Logs and traces flow through four stages, all running in your cluster in the chalk-telemetry namespace:

Collector (OpenTelemetry or Vector) → Aggregator → ClickHouse → Dashboard

- The collector receives logs and traces emitted by the engine and resolvers. These pods are named
chalk-collector-*.
- The aggregator batches and forwards them. This pod is named chalk-telemetry-aggregator-*.
- ClickHouse is the database that stores the logs and traces. This is the clickhouse-0 pod. Traces have
a default TTL of 7 days, which Chalk shortens automatically as storage fills.
- The dashboard reads from ClickHouse to render the logs views and the Trace tab.

Because the whole pipeline is just a set of pods, most telemetry problems are infrastructure problems (a pod
that can't schedule, or one that's out of memory) rather than bugs in your features or resolvers. The two
places you'll work from are:

- Settings > Shared Resources > Telemetry: view and edit the deployment (CPU/memory for ClickHouse, the
collector, and the aggregator; ClickHouse storage size; collector runtime; and more). The guided form and a
raw JSON editor are both available here.
- Infrastructure > Kubernetes: inspect pod health. On the All Pods tab, set the namespace filter to
chalk-telemetry to see only the telemetry pods and their status, restarts, and age. Click a pod to view its
logs.

### Why are my logs or traces not showing up in the dashboard?

Logs and traces share the same pipeline, so the first checks are the same:

- Did the telemetry settings change recently? A recent edit to the ClickHouse or aggregator resources on the Settings > Shared Resources > Telemetry page is the most common trigger.
- Are the telemetry pods healthy? Go to Infrastructure > Kubernetes, filter the namespace to
chalk-telemetry, and confirm clickhouse-0, the chalk-collector-* pods, and the
chalk-telemetry-aggregator-* pod are all Running. If any are Pending, OOMKilled, or in
CrashLoopBackOff, see the next question.
- For traces specifically, also confirm tracing is actually enabled and sampled. See below.

Note the 7-day trace TTL: traces older than roughly a week are expected to age out, and the window can be
shorter if ClickHouse storage filled up. A missing old trace is usually expected behavior, not a failure.

### What does "ClickHouse server is not currently reachable" mean?

You may see:

This is the most frequent telemetry error. It appears when the logs / traces views try to query ClickHouse and
can't reach it.

It only affects your ability to view telemetry in the dashboard. Online and offline query serving and
feature computation are completely unaffected, and your application keeps working normally.

Causes, in order of likelihood:

- Transient unavailability. ClickHouse was briefly down for maintenance, or its pod was just rescheduled
onto a new node. Wait a few minutes and refresh. If it clears, no action is needed.
- The ClickHouse (or aggregator) pod can't schedule. Check the clickhouse-0 pod in the chalk-telemetry
namespace (Infrastructure > Kubernetes, namespace filter chalk-telemetry). If it's Pending with
FailedScheduling events, its resource request is too large for the node it targets. The classic mistake
is setting a memory request equal to the node's total memory (for example, a 32Gi request on a 32Gi
m6a.2xlarge). Daemonsets and system overhead consume part of every node, so a request that equals node
capacity can never be placed and the pod stays Pending forever. Pinning a fixed instance type and a high
request makes this especially easy to trip.
- OOM / CrashLoopBackOff. The aggregator can run out of memory under heavy query volume, and ClickHouse can
restart under memory pressure. Repeated OOMKilled status means memory needs to go up, but only to a value
that still fits the node.

Self-serve fix: go to Settings > Shared Resources > Telemetry and adjust the memory in the ClickHouse
and/or Aggregator sections.

- If a pod can't schedule: lower the request so it fits the node (leave headroom below the instance's total
memory, for example request 30Gi on a 32Gi machine rather than 32Gi), or raise the instance type. Never set
the request equal to the machine size.
- If a pod is OOMing: raise the memory request for the resource that ran out of memory (the aggregator is
often the right knob), staying under node capacity.

Then Save & Apply, confirm clickhouse-0 reaches Running, and reload the dashboard. If it still can't
reach ClickHouse after about 10 minutes, contact Chalk support.

### I added --trace but no trace appears

Work through these in order:

- Check the flag syntax. It is --trace, with no space. Writing chalk query … -- trace (a space after
--) is parsed as "end of flags" plus a stray positional argument, so tracing is silently not enabled and
you get no error.
- Check sampling. Tracing across your environment is governed by OTEL_TRACES_SAMPLER and OTEL_TRACES_SAMPLER_ARG
(Settings > Integrations > Environment Variables). The defaults are parentbased_traceidratio at 0.01
(1%). If the sampler is set to always_off, query-time flags are ignored and nothing is traced.
With a parentbased_* sampler you can still force a single query with --trace (CLI) or trace=True
(ChalkClient). See Tracing for the full table of samplers.
- Check the client. Tracing requires the chalkpy[tracing] extra and chalkpy >= 2.95.9.
- View it in the right place. A traced query's waterfall appears in the Trace tab of the Online Query page.

### How do I use traces to find a slow query?

Once tracing works, traces are the primary tool for locating latency. Enable tracing on a representative query,
open the Trace tab, and look for the longest spans. The waterfall shows where time is spent and what called
what. For example, it distinguishes fast Redis reads from slow warehouse or Postgres round trips.

From there:

- Target the offending data source (add indices, increase its CPU/memory, or add read replicas).
- Accelerate any unaccelerated resolvers that dominate the trace. See Static Resolver Optimization.
- Consider querying through the gRPC client (ChalkGRPCClient), which is generally faster than HTTP.

For the broader set of tools for debugging queries (Query Plan Visualizer, Resolver Replay, logging), see
Debugging Queries.

### When should I contact Chalk support?

Reach out to Chalk support if:

- The "not reachable" error persists beyond about 10 minutes after confirming the telemetry pods are healthy.
- Pods keep OOMing or restarting even after correctly sizing memory within node limits.
- ClickHouse won't schedule and it's unclear which combination of request and instance type will fit.
- You're considering Serve over HTTP or other advanced fields. Confirm with the Chalk team first.

### Additional resources

- Tracing: how to enable tracing, sampling configuration, and viewing the Trace tab.
- Observability Overview: the full observability section, including metrics and alerting.
- Log Export: export logs to external monitoring systems and the log field schema.
- Debugging Queries: Query Plan Visualizer, Resolver Replay, and logging.
- Kubernetes Resources Overview: how Chalk uses Kubernetes resources.