Intelligent OperationsDeep Dives

From Experiment to Production: An AI Operations Checklist

The gap between demo and deploy is wider than you think.

The Prompt Engineering Project March 25, 2025 12 min read

Quick Answer

AI production deployment requires bridging the gap between a working prototype and a reliable system. This involves building evaluation suites with quantitative metrics, implementing prompt versioning and rollback, adding monitoring for output quality and latency, handling model failures gracefully, managing cost at scale, and establishing human review workflows for high-stakes outputs.

The gap between a working prototype and a production system is where most AI projects fail. The demo works. The notebook runs. The stakeholders are excited. And then someone asks: what happens when the model is down? How do we roll back? What does this cost per request? Who debugs it at 2 AM? These questions are not edge cases. They are the requirements that separate experiments from systems.

This checklist is targeted at technical leaders responsible for moving AI features from proof-of-concept to production. Each item represents a failure mode we have observed in real deployments. None of them are optional. Skip one, and it will surface as an incident within the first quarter of operation.

1. Evaluation Framework

Define success metrics before you build. This sounds obvious, but the majority of AI projects we audit have no formal evaluation framework. The team knows the feature "works" because they tested it manually with a handful of inputs and the outputs looked reasonable. That is not evaluation. That is anecdote.

A production evaluation framework has three components. First, a dataset of representative inputs spanning the full range of expected use cases, including edge cases and adversarial inputs. Second, a set of metrics that map to business outcomes -- not just accuracy, but latency, cost, format compliance, safety, and user satisfaction. Third, an automated pipeline that runs the dataset against the current prompt and model configuration and produces a scorecard on every change.

eval-suite.ts
interface EvalCase {
  id: string
  input: string
  expectedOutput?: string       // For exact match
  criteria: EvalCriterion[]     // For scored evaluation
  tags: string[]                // For filtering and reporting
}

interface EvalCriterion {
  name: string                  // e.g., 'format_compliance'
  scorer: 'exact' | 'contains' | 'llm_judge' | 'regex' | 'custom'
  weight: number                // Relative importance
  threshold: number             // Minimum passing score (0-1)
}

// Run the suite on every prompt change
async function runEvalSuite(
  suite: EvalCase[],
  promptVersion: string,
  model: string
): Promise<EvalReport> {
  const results = await Promise.all(
    suite.map(testCase => evaluateCase(testCase, promptVersion, model))
  )
  return aggregateResults(results)
}
If you cannot articulate how you will measure success before you start building, you do not understand the problem well enough to build a production system. Go back to requirements.

2. Prompt Versioning

Prompts change. They change during development, they change during tuning, they change when models update, and they change when requirements shift. Without versioning, you cannot answer basic operational questions: which version of the prompt is currently in production? When did it change? What was the previous version? Did the change improve or degrade performance?

Treat prompts as code. Store them in version control. Use semantic versioning: major version for structural changes that alter output format, minor version for behavioral changes that alter output quality, patch version for typo fixes and clarifications. Tag every deployment with the prompt version and model version so you can correlate production behavior with specific configurations.

prompt-registry.yaml
prompts:
  customer-support-router:
    current: 2.3.1
    model: claude-sonnet-4-20250514
    deployed: 2025-03-15T14:30:00Z
    changelog:
      - version: 2.3.1
        date: 2025-03-15
        note: "Clarified escalation criteria for billing disputes"
      - version: 2.3.0
        date: 2025-03-10
        note: "Added sentiment detection to routing logic"
      - version: 2.2.0
        date: 2025-02-28
        note: "Expanded category taxonomy from 8 to 12 categories"

A prompt without a version number is a configuration without an audit trail. In production, that is unacceptable.

3. Monitoring and Observability

AI systems fail differently than traditional software. A web server either responds or it does not. An AI system can respond with confident, well-formatted, completely wrong output. Your monitoring must account for this.

Instrument four dimensions. Latency: time from request to complete response, broken down by model inference time, tool execution time, and network overhead. Token usage: input tokens, output tokens, and total tokens per request, tracked by prompt version and model. Output quality: automated scoring of a sample of production outputs against your evaluation criteria. Error rates: not just HTTP errors, but semantic errors -- outputs that are syntactically valid but factually wrong, off-topic, or unsafe.

observability.ts
interface AIRequestMetrics {
  requestId: string
  promptVersion: string
  model: string
  inputTokens: number
  outputTokens: number
  latencyMs: number
  toolCalls: { name: string; durationMs: number }[]
  qualityScore?: number       // From async eval sampling
  errorType?: 'none' | 'timeout' | 'rate_limit' | 'model_error' | 'quality'
  costUsd: number
  timestamp: string
}

// Emit metrics after every request
function recordMetrics(metrics: AIRequestMetrics) {
  telemetry.emit('ai.request', metrics)
  if (metrics.latencyMs > LATENCY_THRESHOLD) {
    alerts.warn('ai.slow_request', metrics)
  }
  if (metrics.errorType && metrics.errorType !== 'none') {
    alerts.error('ai.request_error', metrics)
  }
}
Sample-based quality monitoring catches degradation before users report it. Run your evaluation suite against 5-10% of production traffic asynchronously. If scores drop below your threshold, alert before the support tickets arrive.

4. Cost Tracking

AI features have variable costs that scale with usage in ways traditional features do not. A database query costs fractions of a cent. A model inference can cost cents to dollars per request. At scale, the difference between a well-optimized prompt and a wasteful one is thousands of dollars per month.

Track cost per request, per feature, per user tier, and per model. Calculate the cost of every prompt change before deploying it -- if a new prompt adds 500 tokens of context, compute the monthly cost impact at current traffic volume. Implement caching for deterministic queries: if the same input always produces the same output, cache the output and skip the model call entirely. Use model routing to send simple requests to cheaper models and complex requests to more capable ones.

A prompt that adds 1,000 tokens of unnecessary context costs roughly $0.003 per request on Claude Sonnet. At 100,000 requests per day, that is $9,000 per month in wasted spend. Context optimization is cost optimization.

5. Fallback Strategies

Models go down. APIs get rate-limited. Outputs fail validation. Your system must handle all three gracefully. A fallback strategy defines what happens when the primary path fails, and it must be designed and tested before you need it.

Three-tier fallback is the minimum. Tier one: retry with exponential backoff for transient failures (timeouts, 429 responses, network errors). Tier two: fail over to an alternative model. If Claude is unavailable, route to GPT. If the primary model is rate-limited, use a smaller, faster model for degraded but functional output. Tier three: return a cached or static response. For classification tasks, return a default category. For generation tasks, return a template response that acknowledges the limitation.

fallback.ts
async function inferWithFallback(request: InferenceRequest): Promise<InferenceResult> {
  // Tier 1: Primary model with retry
  try {
    return await withRetry(
      () => callModel(request, { model: 'claude-sonnet-4-20250514' }),
      { maxRetries: 3, backoffMs: 1000 }
    )
  } catch (primaryError) {
    logger.warn('primary_model_failed', { error: primaryError })
  }

  // Tier 2: Alternative model
  try {
    return await callModel(request, { model: 'gpt-4o' })
  } catch (secondaryError) {
    logger.warn('secondary_model_failed', { error: secondaryError })
  }

  // Tier 3: Cached or static response
  const cached = await getCachedResponse(request.cacheKey)
  if (cached) {
    return { ...cached, source: 'cache', degraded: true }
  }

  // Final fallback: structured error
  return {
    output: null,
    error: 'ALL_MODELS_UNAVAILABLE',
    source: 'fallback',
    degraded: true,
  }
}

6. Security Review

AI features introduce security concerns that your existing security review process may not cover. Input validation must account for prompt injection: user input that attempts to override system instructions. Output sanitization must account for the model generating content that includes PII, copyrighted material, or executable code. Data handling must account for the fact that user inputs are sent to a third-party API.

Conduct a threat model specific to each AI feature. Identify what data flows to the model, what the model can access via tools, what the model output is used for downstream, and what happens if the model produces adversarial output. Document the threat model and review it when the feature changes.

If your AI feature can execute actions (send emails, modify data, trigger deployments), a prompt injection vulnerability is equivalent to a remote code execution vulnerability. Treat it with the same severity.

7. Documentation

AI features are harder to document than traditional features because their behavior is probabilistic. But this makes documentation more important, not less. Document three things for every AI feature.

First, the system prompt and its rationale. Every directive in the system prompt exists for a reason. Document those reasons so that future engineers know which lines are load-bearing and which can be modified. Second, the tool definitions and their expected behavior, including edge cases and known limitations. Third, runbooks for common failure modes: what to do when quality degrades, when costs spike, when the model produces unsafe output, when an upstream API goes down.

A system prompt without documentation is a configuration file without comments. It works until someone changes it, and then it breaks in ways nobody can diagnose.

8. Team Training

Who on your team can debug a prompt regression? Who understands why the system prompt is structured the way it is? Who knows how to interpret the evaluation scores? If the answer to any of these questions is one person, you have a bus factor problem.

AI operations require a new set of skills that most engineering teams do not have yet: prompt debugging, evaluation design, token optimization, model behavior analysis, and cost modeling. Invest in cross-training. Run prompt review sessions the same way you run code reviews. Pair junior engineers with senior prompt engineers on debugging tasks. Build a shared understanding of how the system works so that operational knowledge does not concentrate in a single person.

9. Rollback Plan

Every deployment must have a rollback plan that can be executed in under five minutes. For prompt changes, this means maintaining the previous prompt version in a deployable state. For model changes, this means keeping the previous model configuration available. For feature changes, this means a feature flag that can disable the AI feature entirely and fall back to the non-AI behavior.

rollback.ts
// Feature flag configuration
const AI_FEATURE_FLAGS = {
  'customer-support-router': {
    enabled: true,
    promptVersion: '2.3.1',
    model: 'claude-sonnet-4-20250514',
    rollbackVersion: '2.2.0',
    rollbackModel: 'claude-sonnet-4-20250514',
    fallbackBehavior: 'keyword-matching',  // Non-AI fallback
  },
}

// Rollback execution
async function rollback(featureId: string) {
  const config = AI_FEATURE_FLAGS[featureId]
  await updateDeployedConfig(featureId, {
    promptVersion: config.rollbackVersion,
    model: config.rollbackModel,
  })
  logger.critical('ai_rollback_executed', { featureId })
  alerts.page('ai_rollback', { featureId })
}

Test the rollback plan before you need it. Run a rollback drill quarterly. Verify that the rollback version still produces acceptable outputs. Verify that the non-AI fallback still works. A rollback plan that has never been tested is not a plan -- it is a hope.

10. Load Testing

AI features behave differently under load than traditional software features. Model APIs have rate limits that are lower than database connection limits. Token processing takes longer than most API calls. Concurrent requests can exhaust your API quota in minutes. Load testing must account for these characteristics.

Test at 2x your expected peak traffic. Measure not just throughput and latency, but token consumption rate, API error rates, and fallback activation frequency. Identify the breaking point: at what traffic level does the system start returning degraded responses? At what level does it start failing entirely? Use these numbers to set rate limits, configure auto-scaling, and plan capacity.

Model API rate limits are typically measured in requests per minute and tokens per minute. Your load test must track both. You can be well within your request limit and still hit the token limit if your prompts are large.

This checklist is not comprehensive. Your specific deployment will have additional requirements driven by your domain, regulatory environment, and organizational constraints. But these ten items represent the minimum viable operational foundation for an AI feature in production. Skip any of them and you are not operating -- you are hoping.

Key Takeaways

1

Define evaluation metrics and build automated eval suites before writing the first prompt. You cannot improve what you do not measure.

2

Treat prompts as versioned, reviewed, tested code. Every production prompt needs a version number, a changelog, and a rollback path.

3

Monitor four dimensions: latency, token usage, output quality, and error rates. Traditional uptime monitoring is necessary but not sufficient.

4

Design three-tier fallback: retry, alternative model, static response. Test each tier independently and verify they activate correctly.

5

AI features introduce novel security concerns including prompt injection and data exfiltration. Conduct a threat model for every AI feature.

Frequently Asked Questions

Common questions about this topic

Building an MCP Server: Architecture DecisionsDark Mode Isn't Optional: Designing AI Interfaces for Real Use

Related Articles

Intelligent Operations

AI Cost Management: A Framework for Token Budgets at Scale

AI costs scale with usage. Here's a framework for model selection, caching strategies, prompt optimization, and per-outp...

Intelligent Operations

Observability for AI Systems: What to Monitor and Why

AI systems fail differently than traditional software. Here's what to monitor: latency, tokens, quality scores, error ra...

Intelligent Operations

Team Structures for AI-Augmented Organizations

The prompt engineer, the AI operator, the human-in-the-loop reviewer. How organizational roles evolve when AI handles th...

All Articles