Analyzing Prometheus metrics with LLMs: the token problem
"Just paste the metrics into the model and ask what's wrong" is a tempting workflow. It also gets expensive fast, hits context limits, and produces confidently wrong analysis once the input gets messy. Prometheus exposition format is human-readable, which makes it look LLM-ready. It is not.
Why raw Prometheus text is a bad LLM input
A single /metrics endpoint on a busy service can emit tens of thousands of lines. Every label permutation produces a new series, and labels are usually verbose: method, route, status_code, pod, namespace, cluster, sometimes a request ID that should never have been a label in the first place. The text format repeats the metric name, every label key, and every label value on every line.
Run a real example through a tokenizer and the numbers are sobering:
- A modest Go service exposing 3,000 series produces roughly 40,000-60,000 tokens per scrape.
- A node exporter on a single host: 20,000+ tokens.
- A kube-state-metrics endpoint in a 200-pod cluster: easily 200,000+ tokens.
Multiply by the number of targets you actually want to reason about and you blow past a 200K context window before you've added the prompt, the question, or any retrieval grounding.
Where the tokens go
Three sources dominate token cost:
Label cardinality. A histogram with le buckets, multiplied by route and status code and pod, generates dozens of lines per "logical" measurement. The model doesn't need every bucket to reason about latency. It almost never needs every pod.
HELP and TYPE comments. Useful for humans, low information density for an LLM that already knows what a counter is. They typically eat 10-20% of the payload.
Repeated metric names. http_requests_total{method="GET",route="/api/v1/users",status="200",pod="api-7f8d-x9k2"} repeats the metric name on every line. A flat exposition format is the worst possible shape for compression.
What actually works
Treat the LLM as a reasoning layer over a summary, not a parser of raw exposition. The shape that holds up:
Pre-aggregate with PromQL. Don't ship raw series. Run the questions you'd actually ask, rate, histogram_quantile, topk, increase over a sensible window, and feed the LLM the aggregated result. A topk(10, rate(http_requests_total[5m])) by (route, status) is two orders of magnitude smaller than the raw series and contains all the signal.
Drop labels the model doesn't need. Pod names, instance IDs, replica indices: usually noise for analysis. Aggregate them away with sum without (pod, instance) before serialization. Keep semantically meaningful labels (route, status, error type), drop infrastructure labels unless the question is specifically about infrastructure.
Re-encode in a denser format. CSV or a compact JSON like {"m":"http_req","l":{"r":"/api","s":"500"},"v":12.4} typically uses 30-50% fewer tokens than exposition format for the same content. Every byte of repeated boilerplate is a byte you're paying for at every prompt.
Pair with metadata, not just numbers. The model reasons better when it knows what a metric means. Send a small schema dictionary once (metric name → unit, type, brief description) and reference it from compact rows, instead of inlining HELP comments next to every value.
The challenges that don't go away
Even with good preprocessing, a few problems persist.
Time series are sequences, and tokens are not. An LLM looking at a flat snapshot sees one moment. To reason about a regression, it needs at least two: a baseline and a current window. That doubles the input and doesn't always double the insight. For trend questions, pre-compute the delta and feed that, not both windows.
Hallucinated correlations. Give a model 30 metrics and ask "what caused the latency spike," and it will find a story. It will sound plausible. It is often wrong. Constrain the model with retrieval: ground the answer in a small set of metrics flagged by an anomaly detector or alert, rather than asking it to scan the whole surface.
Cardinality of natural language. Asking "is anything unusual?" is a far harder prompt than "compare p99 latency on /checkout over the last hour vs the same window yesterday." Specific questions produce specific answers. Treat the LLM like a junior engineer who needs the question framed.
Cost compounds. A diagnostic chat that reads metrics, calls a tool, reads more metrics, and iterates can rack up 500K+ tokens per session. Cache aggressively (the prompt cache is your friend if the schema and base context are stable), and short-circuit with deterministic checks before paying the LLM tax.
A practical pattern
What we've seen work in production:
- An MCP-style tool that exposes
query_promqlrather thanget_metrics. The LLM writes the PromQL; the tool returns aggregated results. - A schema endpoint, cached, that lists available metrics and their meaning. Sent once per conversation, then referenced.
- Result serialization in a compact JSON shape, with a hard token budget per tool call (e.g. 4K). Bigger results get truncated with a "top-N by impact" rule.
- A guardrail prompt instructing the model to ground every claim in a returned metric value, and to say "I don't know" when the data doesn't support a conclusion.
The accuracy gain over "paste the whole scrape" is large. The token cost drops by 10-50x. Both matter.
Bottom line
Prometheus exposition format is for humans and scrapers. Don't hand it to an LLM raw. Aggregate, drop labels you don't need, re-encode densely, and let the model query a tool rather than parse a wall of text. The model is the reasoning engine; PromQL is still the right query language. Use both for what they're good at.
If you're connecting metrics to AI workflows, we'd like to hear about it.