LLM observability: an open-source toolkit, Loewen Tech

LLM-backed features look like normal application code, but they fail in different shapes. A 4xx isn't a 4xx; a slow response isn't always slow code; a regression can come from a model deprecation rather than a deploy. The observability you set up for HTTP services covers maybe 30% of what you need to debug an LLM feature in production.

Open-source tooling has caught up enough that you can build a real LLM observability stack without paying a vendor. Here's what to wire together, and why.

What "LLM observability" actually means

Four signals matter, in order of urgency:

Tokens in/out and cost. Every call has a price. Without per-feature attribution you can't tell which user flow exploded the bill.
Latency, wall clock, time-to-first-token, time-to-last-token. End users feel TTFT; total duration drives SLOs.
Quality, pass/fail of an eval, whether the response is grounded in retrieved context, whether it actually answers the question.
Tool/agent traces. When the model calls tools, the chain of model → tool → model needs the same span semantics as a normal trace.

Standard APM gets you HTTP-level latency and errors. Everything above needs LLM-aware tooling.

The open-source landscape

The space moves fast, but a few projects have stabilized:

OpenLLMetry (Traceloop). An OpenTelemetry-native instrumentation layer for LLM SDKs, OpenAI, Anthropic, Bedrock, and others. Auto-captures spans for model calls, tool calls, vector DB queries. Emits OTLP, so any OTel-compatible backend works.

Langfuse. Self-hostable backend specifically for LLM telemetry, traces, prompts, evals, datasets. Strong UI, generous data model. Pairs well with OpenLLMetry as the SDK side and Postgres + ClickHouse as the storage tier.

Arize Phoenix. Open-source LLM tracing and evals tool, runs locally or as a server. Particularly strong on retrieval debugging, embedding drift visualizations, retrieval quality views.

OpenLIT. Newer, OTel-based, focused on cost and performance dashboards out of the box.

Helicone. Proxy-based, sits in front of OpenAI/Anthropic and logs every request. Lower-touch instrumentation; trade-off is one more network hop and a single point of failure.

These overlap. You don't need all of them.

A workable stack

For most teams shipping LLM features today:

Instrumentation: OpenLLMetry. OTel-native, no proxy, language coverage is good.
Storage and UI: Langfuse. Self-host on Postgres + ClickHouse, or use their cloud.
Evals: start with Langfuse's eval runners; graduate to a dedicated eval framework (Promptfoo, DeepEval) when you need offline regression suites.

This gets you per-call traces with token counts, latency breakdowns, prompt/response capture, and a place to wire eval scores back to the same trace.

Things that bite you in production

A few things worth knowing before you ship:

PII in prompts. Every observability tool wants to capture the prompt text. Decide your redaction policy before you turn on capture, not after. Most tools support span attribute filtering at the SDK or collector level.
Sampling is harder than for HTTP. A 1% sample of an LLM endpoint can hide a regression that only shows up on a specific input class. Prefer outcome-based sampling, sample 100% of failed evals, then a smaller fraction of healthy traffic.
Cost data needs care. Token counts come from the model response; cost = tokens × price-per-1k. Track price per model in a single place; an upstream price change shouldn't require a code deploy to reflect in dashboards.
Model deprecations are silent. Wire alerts to the model identifier in your trace attributes. If gpt-4-0613 calls drop to zero, you want to find out before the model starts returning 404s.

A small principle

Treat LLM telemetry as part of the same OpenTelemetry pipeline as the rest of your services. Same collector, same backend, correlated trace IDs. The temptation is to bolt on a separate "LLM dashboard", and you may want one for cost visibility, but the underlying spans should live in the same trace as the HTTP request that triggered them. Otherwise you're back to switching tabs to debug, which is the exact thing observability is meant to fix.

Building LLM features that need to behave in production? Let's talk.