Intelligent OperationsDeep Dives

Observability for AI Systems: What to Monitor and Why

Latency, token usage, output quality, error rates, cost per request.

The Prompt Engineering Project February 15, 2025 11 min read

Quick Answer

AI system monitoring provides visibility into LLM application performance, quality, and cost in production. Essential components include request-level trace logging, latency and error rate dashboards, token usage and cost tracking, output quality scoring, model drift detection, and user feedback loops. Unlike traditional observability, AI monitoring must track non-deterministic outputs, making quality evaluation and anomaly detection uniquely challenging.

Traditional software fails in binary ways. A request returns a 500 or it does not. A database query times out or it completes. The monitoring tools we have built over the past two decades -- Datadog, New Relic, PagerDuty -- are designed for this world. They watch for hard failures: error rates, latency spikes, resource exhaustion.

AI systems fail differently. A language model that returns a 200 OK with confidently wrong information has not failed by any metric your APM tool tracks. A model that produces valid JSON containing a hallucinated statistic will pass every health check. A system that costs three times more than expected because a prompt redesign doubled token consumption will show green across every dashboard until the invoice arrives.

AI observability requires monitoring five dimensions that traditional tools either ignore or handle poorly: latency distribution, token usage, output quality, error classification, and cost per request. Each dimension has its own instrumentation requirements, its own failure signatures, and its own alerting thresholds. This article covers all five, with implementation details for each.

Dimension 1: Latency Distribution

Latency in AI systems is not a single number. It varies by model, by prompt length, by requested output length, and by current provider load. Tracking p50 and p99 is a start, but it misses the patterns that matter for debugging and capacity planning.

The metrics you need are latency distribution by model (GPT-4o vs. Claude Sonnet vs. Gemini respond on fundamentally different timescales), latency by prompt token count (longer context windows mean longer time-to-first-token), latency by output token count (a 50-token classification response and a 2000-token analysis response have different latency profiles even from the same model), and time-to-first-token versus total generation time (streaming applications care about the former; batch applications care about the latter).

The failure signature to watch for is latency bimodality. If your latency histogram shows two distinct peaks, it usually means your traffic is splitting between a fast path (cache hits, short responses) and a slow path (cache misses, long generation). This bimodality confuses traditional p50/p99 alerting because both peaks might be "normal" individually while the distribution as a whole indicates a problem.

Do not alert on average latency. A system that serves half its requests in 200ms and half in 8 seconds has an average of 4.1 seconds, which tells you nothing useful about either population.

Dimension 2: Token Usage

Token usage is the most directly actionable metric in AI observability because it maps linearly to cost. Every token you track is a dollar you can account for. The three measurements that matter are input tokens per request (how much context you are sending), output tokens per request (how much the model is generating), and the input-to-output ratio (which reveals whether your system is context-heavy or generation-heavy).

A context-heavy system -- high input tokens, low output tokens -- is typically doing classification, extraction, or summarization. The optimization lever is context compression: can you send less context without degrading output quality? A generation-heavy system -- low input tokens, high output tokens -- is typically doing content creation or code generation. The optimization lever is output constraints: can you specify a maximum length or a more concise format?

Track token usage over time to catch drift. A prompt that used 800 input tokens last month but uses 1,200 this month has probably accumulated additional context through well-meaning edits. Each edit was individually reasonable. The cumulative effect is a 50% cost increase that nobody noticed because nobody was watching.

token-tracking-middleware.ts
interface TokenMetrics {
  requestId: string
  model: string
  promptName: string
  inputTokens: number
  outputTokens: number
  totalTokens: number
  costUsd: number
  latencyMs: number
  timestamp: Date
}

const MODEL_PRICING: Record<string, { input: number; output: number }> = {
  'gpt-4o':         { input: 2.50,  output: 10.00 },
  'gpt-4o-mini':    { input: 0.15,  output: 0.60  },
  'claude-sonnet':  { input: 3.00,  output: 15.00 },
  'claude-haiku':   { input: 0.25,  output: 1.25  },
}

function calculateCost(
  model: string,
  inputTokens: number,
  outputTokens: number
): number {
  const pricing = MODEL_PRICING[model]
  if (!pricing) return 0
  return (
    (inputTokens / 1_000_000) * pricing.input +
    (outputTokens / 1_000_000) * pricing.output
  )
}

export function createObservabilityMiddleware(logger: MetricsLogger) {
  return async function observabilityMiddleware(
    request: AIRequest,
    next: (req: AIRequest) => Promise<AIResponse>
  ): Promise<AIResponse> {
    const start = performance.now()
    const response = await next(request)
    const latencyMs = performance.now() - start

    const metrics: TokenMetrics = {
      requestId: request.id,
      model: request.model,
      promptName: request.promptName ?? 'unknown',
      inputTokens: response.usage.prompt_tokens,
      outputTokens: response.usage.completion_tokens,
      totalTokens: response.usage.total_tokens,
      costUsd: calculateCost(
        request.model,
        response.usage.prompt_tokens,
        response.usage.completion_tokens
      ),
      latencyMs,
      timestamp: new Date(),
    }

    logger.record(metrics)

    // Alert on anomalies
    if (metrics.costUsd > 0.50) {
      logger.alert('high-cost-request', metrics)
    }
    if (metrics.latencyMs > 30_000) {
      logger.alert('high-latency-request', metrics)
    }

    return response
  }
}

Dimension 3: Output Quality

Output quality is the dimension that separates AI observability from traditional monitoring. Traditional systems do not need to evaluate whether their output is "good" -- a database either returns the correct rows or it does not. Language models return outputs that exist on a spectrum from excellent to subtly wrong to completely fabricated, and your monitoring must account for that spectrum.

Three automated quality signals can be measured without human involvement. Format compliance checks whether the output matches the expected structure: valid JSON, correct schema, required fields present, enum values within allowed sets. This is the easiest to implement and catches the most obvious failures. Factual grounding measures whether claims in the output can be traced to the provided context. This requires comparing the output against the input context and flagging assertions that appear unsupported. Relevance scoring evaluates whether the output addresses the actual request. A model that produces a well-formatted, factually grounded response to the wrong question has failed in a way that format and grounding checks will miss.

Beyond automated signals, you need a human review sampling strategy. Not every output needs human review. But a statistically significant sample -- typically 2-5% of production traffic -- should be scored by human reviewers on a regular cadence. This serves two purposes: it catches quality issues that automated checks miss, and it calibrates your automated checks against human judgment over time.

quality-scoring.ts
interface QualityScore {
  requestId: string
  formatCompliance: number    // 0-1: schema match
  groundedness: number        // 0-1: claims supported by context
  relevance: number           // 0-1: addresses the actual request
  composite: number           // weighted average
  needsHumanReview: boolean
}

async function scoreOutput(
  request: AIRequest,
  response: AIResponse,
  schema?: z.ZodSchema
): Promise<QualityScore> {
  // Format compliance: does it parse and match schema?
  let formatCompliance = 1.0
  if (schema) {
    const parsed = schema.safeParse(response.content)
    formatCompliance = parsed.success ? 1.0 : 0.0
  }

  // Groundedness: use a lightweight model to check claims
  const groundedness = await checkGroundedness(
    request.context,
    response.content
  )

  // Relevance: does the output address the request?
  const relevance = await checkRelevance(
    request.userQuery,
    response.content
  )

  const composite =
    formatCompliance * 0.3 +
    groundedness * 0.4 +
    relevance * 0.3

  return {
    requestId: request.id,
    formatCompliance,
    groundedness,
    relevance,
    composite,
    // Flag for human review if below threshold
    // or randomly sample 3% of passing requests
    needsHumanReview:
      composite < 0.85 || Math.random() < 0.03,
  }
}

A model that returns a 200 OK with confidently wrong information has not failed by any metric your APM tool tracks. Quality scoring closes that gap.

Dimension 4: Error Classification

AI systems produce four categories of errors, and each requires different handling. API errors are the most familiar: 429 rate limits, 500 server errors, 503 service unavailable. These are transient and recoverable with retry logic. Timeout errors occur when generation takes longer than your deadline. They correlate with prompt length and requested output complexity, and the fix is usually model selection or prompt optimization rather than longer timeouts. Rate limit errors are a capacity planning signal. If you are hitting rate limits regularly, you need either a higher tier, request queuing, or a secondary model for overflow. Malformed output errors are the most insidious because they are not HTTP errors. The request succeeds, the model returns a 200, but the content is unparseable or does not match the expected schema.

Track each category separately with its own alert threshold. An API error rate of 2% is normal. A malformed output rate of 2% is a prompt engineering emergency. Combining them into a single "error rate" metric hides the signal you need to act on.

< 1%
API Errors
< 0.5%
Timeouts
< 0.1%
Rate Limits
< 0.5%
Malformed Output
A combined "error rate" metric is worse than no metric at all. It hides the distinction between transient infrastructure issues and systemic prompt failures.

Dimension 5: Cost Per Request

Cost per request is the metric that connects AI observability to business viability. At small scale, the cost of individual requests is negligible. At production scale -- thousands or millions of requests per day -- the difference between a well-optimized and a poorly-optimized system is the difference between a sustainable product and one that loses money on every transaction.

The implementation requires real-time cost tracking at the request level. Every request should be tagged with its model, its token counts, and its computed cost. These should be aggregated by prompt name, by user segment, by feature, and by time period. Budget alerts should fire when daily, weekly, or monthly spend exceeds thresholds, and when the cost of a specific prompt or feature trends upward beyond expected growth.

Model-specific cost breakdowns reveal optimization opportunities. If 80% of your cost comes from 20% of your prompts, the optimization target is obvious. If a feature uses GPT-4o for a task that Claude Haiku could handle with equivalent quality, the cost reduction is 40x. These optimizations are invisible without per-request cost tracking.

grafana-ai-dashboard.yml
# Grafana Dashboard Configuration for AI Observability
apiVersion: 1
providers:
  - name: 'AI Observability'
    folder: 'AI Systems'
    type: file

# Panel 1: Cost Overview
panels:
  - title: 'Daily Cost by Model'
    type: timeseries
    datasource: prometheus
    targets:
      - expr: |
          sum by (model) (
            rate(ai_request_cost_usd_total[1h])
          ) * 3600 * 24
    fieldConfig:
      defaults:
        unit: currencyUSD

  - title: 'Cost per Request by Prompt'
    type: barchart
    datasource: prometheus
    targets:
      - expr: |
          avg by (prompt_name) (
            ai_request_cost_usd_total
          )

  - title: 'Token Usage Ratio'
    type: stat
    datasource: prometheus
    targets:
      - expr: |
          sum(ai_output_tokens_total)
          / sum(ai_input_tokens_total)

  - title: 'Quality Score Distribution'
    type: histogram
    datasource: prometheus
    targets:
      - expr: ai_quality_composite_score

  - title: 'Error Rate by Category'
    type: piechart
    datasource: prometheus
    targets:
      - expr: |
          sum by (error_type) (
            rate(ai_errors_total[1h])
          )

  - title: 'Latency by Model (p50 / p95 / p99)'
    type: timeseries
    datasource: prometheus
    targets:
      - expr: |
          histogram_quantile(0.50,
            rate(ai_request_duration_seconds_bucket[5m])
          )
      - expr: |
          histogram_quantile(0.95,
            rate(ai_request_duration_seconds_bucket[5m])
          )
      - expr: |
          histogram_quantile(0.99,
            rate(ai_request_duration_seconds_bucket[5m])
          )

Tooling: Assembling the Stack

No single tool covers all five dimensions of AI observability. The production stack we recommend uses three layers. Langfuse handles LLM-specific observability: tracing individual requests through multi-step chains, capturing token usage, measuring generation latency, and scoring output quality. It is purpose-built for language model systems and understands concepts like prompt versioning, model comparison, and evaluation datasets that general-purpose tools do not.

Grafana handles dashboarding and alerting. It aggregates metrics from Langfuse and your custom middleware into unified dashboards that show cost, quality, latency, and error rates in a single view. Grafana's alerting engine supports the multi-threshold approach you need: different alert rules for different error categories, different cost thresholds for different models.

Custom logging middleware bridges the gap between your application layer and your observability tools. The middleware shown earlier in this article captures all five dimensions at the request level and ships them to both Langfuse (for LLM-specific analysis) and Prometheus (for Grafana dashboards). The middleware is the instrumentation layer. Langfuse and Grafana are the analysis layers.

The architecture is straightforward: every AI request passes through the middleware, which records metrics to your time-series database and sends traces to Langfuse. Grafana reads from the time-series database for dashboards and alerts. Langfuse provides the drill-down view when you need to investigate a specific request or compare prompt versions. Together, they give you both the altitude of a dashboard and the resolution of a trace.

Start with the middleware. You can swap dashboarding tools later, but if you are not capturing the right data at the request level from the beginning, no amount of tooling will compensate.

Key Takeaways

1

AI systems fail differently than traditional software. A 200 OK with hallucinated content is a failure that no APM tool will catch.

2

Monitor five dimensions: latency distribution by model and prompt, token usage and input/output ratios, automated output quality scoring, classified error rates, and real-time cost per request.

3

Track error categories separately. An API error rate and a malformed output rate require fundamentally different responses.

4

The observability stack for production AI is Langfuse for LLM tracing, Grafana for dashboards and alerts, and custom middleware to capture all five dimensions at the request level.

5

Start by instrumenting every request with a logging middleware. The data you capture today determines what you can diagnose tomorrow.

Frequently Asked Questions

Common questions about this topic

Team Structures for AI-Augmented OrganizationsPrompt Libraries as Business Infrastructure

Related Articles

Intelligent Operations

From Experiment to Production: An AI Operations Checklist

Evaluation framework, prompt versioning, monitoring, cost tracking, fallback strategies, security review, documentation,...

Intelligent Operations

AI Cost Management: A Framework for Token Budgets at Scale

AI costs scale with usage. Here's a framework for model selection, caching strategies, prompt optimization, and per-outp...

Prompt Engineering Craft

Debugging Prompts: A Systematic Approach

Isolation testing, token analysis, output comparison, and systematic elimination. A debugging methodology for prompts th...

All Articles