How do I reduce my AI API costs?

Start with model routing: use smaller models like GPT-4o-mini or Claude Haiku for simple tasks and reserve frontier models for complex reasoning. Add semantic caching for repeated or similar queries. Compress prompts by removing redundant instructions. Batch non-real-time requests. These four strategies alone typically cut costs by 60-80%.

What is semantic caching for LLM APIs?

Semantic caching stores LLM responses and matches new queries against cached ones using embedding similarity, not exact string matching. If a new query is semantically similar enough to a cached query, the cached response is returned instantly. This eliminates redundant API calls and can serve 20-40% of queries from cache in typical applications.

How does model routing reduce AI costs?

Model routing classifies incoming requests by complexity and routes them to the most cost-effective model that can handle them. A lightweight classifier or rules-based system directs simple queries to models that cost 10-50x less than frontier models. Only complex reasoning, nuanced generation, or high-stakes decisions go to the premium tier.

How much do different LLM APIs cost?

Costs vary dramatically. GPT-4o runs about $2.50-$10 per million tokens while GPT-4o-mini is $0.15-$0.60. Claude Opus is the most expensive at roughly $15-$75 per million tokens while Haiku costs $0.25-$1.25. Open-source models on your own infrastructure can drop costs further but add operational complexity.

What is prompt optimization for cost reduction?

Prompt optimization means rewriting prompts to use fewer tokens while maintaining output quality. Techniques include removing verbose instructions, using concise system prompts, replacing examples with structured schemas, and leveraging few-shot prompts efficiently. A well-optimized prompt can be 40-60% shorter than the original with equal or better results.

Should I self-host open-source models to save money?

Self-hosting saves money at high volume (typically above 10-50 million tokens per day) but adds infrastructure and maintenance costs. Use managed APIs during early growth, then evaluate self-hosting when API bills exceed infrastructure costs. Hybrid approaches work well: self-host for high-volume simple tasks while using APIs for frontier capabilities.

Intelligent OperationsDeep Dives

AI Cost Management: A Framework for Token Budgets at Scale

Model selection, caching, prompt optimization, and cost tracking.

The Prompt Engineering Project February 17, 2025 11 min read

Quick Answer

AI API cost optimization uses techniques like semantic caching, intelligent model routing, prompt compression, and request batching to reduce LLM spending by 50-90%. Route simple queries to smaller, cheaper models and reserve expensive models for complex tasks. Cache frequent responses, shorten prompts without losing quality, and batch non-urgent requests. Most teams overspend because they use the most capable model for every request.

AI costs scale with usage. This sounds obvious, but the implications catch most teams off guard. Traditional software has fixed infrastructure costs that are largely decoupled from usage volume. A database query costs the same whether ten users or ten thousand users trigger it. AI is different. Every request consumes tokens. Every token has a price. And that price varies by model, by provider, and by the complexity of the task. Without deliberate cost management, AI budgets explode the moment a product finds traction.

This article presents a five-part framework for managing AI costs at scale: model selection, caching, prompt optimization, batch processing, and cost-per-output tracking. Each part addresses a different lever in the cost equation, and the compound effect of applying all five is dramatic -- often a 60 to 80 percent reduction in per-request cost without meaningful quality degradation.

Part One: Model Selection

The most expensive mistake in AI cost management is using the wrong model for the task. Teams default to the most capable model available -- GPT-4, Claude Opus, Gemini Ultra -- for every request, regardless of task complexity. This is the equivalent of using a diesel truck to deliver a letter. The task gets done, but the cost is absurd relative to the requirement.

Model selection should be driven by task complexity, not by habit or fear. Classification tasks -- routing a support ticket, detecting sentiment, categorizing content -- do not require frontier models. They require fast, cheap models that are good at pattern matching. Summarization, extraction, and formatting tasks need moderate capability. Only complex reasoning, multi-step analysis, and creative generation justify the cost of frontier models.

$0.25/1M

Haiku / GPT-4o-mini

$3/1M

Sonnet / GPT-4o

$15/1M

Opus / GPT-4

The cost difference between tiers is not incremental. It is an order of magnitude. A classification task that costs $0.0003 with Haiku costs $0.018 with Opus -- sixty times more for a task where the output quality is indistinguishable. At a thousand requests per day, that difference is $0.30 versus $18. At a million requests per month, it is $300 versus $18,000. Model selection is the single highest-leverage cost optimization available.

Using a frontier model for a classification task is like chartering a 747 for a trip to the grocery store. The capability exists, but the economics make no sense.

Part Two: Caching

Caching is the second-highest leverage optimization, and it is the most underused. The principle is straightforward: if you have sent the same prompt before and received a satisfactory response, return the cached response instead of making a new API call. The cost of the cached response is zero tokens.

Exact Match Caching

The simplest form of caching uses a hash of the full prompt as a cache key. If the exact prompt has been sent before, the cached response is returned immediately. This works well for deterministic tasks where the same input always produces the same desired output: classification, extraction, formatting, and translation.

exact-cache.ts

import { createHash } from 'crypto'

interface CacheEntry {
  response: string
  model: string
  timestamp: number
  ttl: number
}

class PromptCache {
  private store: Map<string, CacheEntry> = new Map()

  private hash(prompt: string, model: string): string {
    return createHash('sha256')
      .update(`${model}:${prompt}`)
      .digest('hex')
  }

  get(prompt: string, model: string): string | null {
    const key = this.hash(prompt, model)
    const entry = this.store.get(key)
    if (!entry) return null
    if (Date.now() - entry.timestamp > entry.ttl) {
      this.store.delete(key)
      return null
    }
    return entry.response
  }

  set(prompt: string, model: string, response: string, ttl = 86400000) {
    const key = this.hash(prompt, model)
    this.store.set(key, {
      response, model,
      timestamp: Date.now(), ttl,
    })
  }
}

Semantic Caching

Exact match caching misses the cases where prompts are similar but not identical. "Summarize this article in three sentences" and "Provide a three-sentence summary of this article" have the same intent but different cache keys. Semantic caching addresses this by embedding the prompt into a vector space and checking for similar prompts within a configurable distance threshold.

Semantic caching is more complex to implement and introduces a quality risk: the cached response for a similar-enough prompt might not be appropriate for the current prompt. The threshold must be tuned carefully. Too loose, and you return irrelevant cached responses. Too tight, and the cache never hits. In practice, a cosine similarity threshold of 0.95 or higher provides a reasonable balance for most use cases.

Semantic caching should never be applied to creative or reasoning-heavy tasks where subtle input differences produce meaningfully different outputs. Reserve it for classification, extraction, and formatting tasks with stable output expectations.

Part Three: Prompt Optimization

Every token in a prompt costs money. At scale, verbose prompts are expensive prompts. The goal of prompt optimization is to reduce token count while maintaining output quality -- and in many cases, shorter prompts actually improve quality by reducing noise and ambiguity.

The optimization process is systematic, not ad hoc. Start with a working prompt. Measure its output quality against an evaluation suite. Then remove content and measure again. If quality is maintained, the removed content was not contributing. If quality drops, restore the content and try removing something else.

Instead of

You are a helpful AI assistant. Your job is to take the following customer support ticket and classify it into one of the following categories. Please read the ticket carefully and think about which category best fits. The categories are: billing, technical, account, feature-request, other. Please respond with only the category name.

Try this

Classify this support ticket into exactly one category: billing, technical, account, feature-request, other. Respond with the category name only.

The verbose prompt uses 67 tokens. The optimized prompt uses 28 tokens. The output quality is identical for this task because classification does not benefit from preamble or encouragement. Across a million requests per month, that difference is 39 million tokens saved -- approximately $10 at Haiku pricing, $117 at Sonnet pricing, and $585 at Opus pricing. And this is a single prompt. A production system typically has dozens.

System prompts are the highest-leverage optimization target because they are prepended to every request. A 100-token reduction in the system prompt saves 100 tokens on every single API call.

Part Four: Batch Processing

Not every AI request needs to be processed in real time. Background tasks -- content classification, embedding generation, bulk summarization, data enrichment -- can be aggregated and processed in batches. Batch processing reduces cost in two ways: lower per-token pricing from providers who offer batch APIs, and reduced overhead from fewer API calls with larger payloads.

Anthropic and OpenAI both offer batch processing endpoints with significant discounts. Anthropic's batch API provides a 50 percent discount on token pricing with a 24-hour processing window. OpenAI's batch API offers similar economics. The tradeoff is latency for cost: batch results are not immediate, but for tasks that do not require real-time responses, the savings are substantial.

Identify batch candidates

Content enrichment, nightly summarization, weekly reports, and embedding updates are strong candidates.

Aggregate requests

Group similar tasks together to minimize prompt overhead per item.

Submit during off-peak hours

Most providers process batches faster during low-demand periods.

Implement result handling

Batch results arrive asynchronously and must be matched back to the original requests.

The organizational challenge is often harder than the technical one. Teams accustomed to real-time AI responses resist the shift to batch processing because it changes the interaction model. The key is identifying which tasks genuinely need real-time responses and which merely default to real-time because nobody considered the alternative.

Part Five: Cost Per Output Tracking

The metric that matters is not cost per token. It is not cost per request. It is cost per useful output. This distinction reframes the entire cost optimization conversation. A request that costs $0.10 and produces a usable result on the first attempt is cheaper than a request that costs $0.01 but requires twenty retries before producing something acceptable.

cost-tracking.ts

interface CostMetrics {
  requestId: string
  model: string
  inputTokens: number
  outputTokens: number
  rawCost: number          // token cost for this request
  attempts: number         // how many tries to get a usable result
  usable: boolean          // did this produce acceptable output?
  effectiveCost: number    // rawCost * attempts (if retried)
  costPerUsableOutput: number  // total spend / usable outputs
}

function calculateEffectiveCost(metrics: CostMetrics[]): {
  totalSpend: number
  usableOutputs: number
  costPerUsableOutput: number
  wasteRate: number
} {
  const totalSpend = metrics.reduce((sum, m) => sum + m.rawCost, 0)
  const usableOutputs = metrics.filter((m) => m.usable).length
  const wastedSpend = metrics
    .filter((m) => !m.usable)
    .reduce((sum, m) => sum + m.rawCost, 0)

  return {
    totalSpend,
    usableOutputs,
    costPerUsableOutput: usableOutputs > 0
      ? totalSpend / usableOutputs
      : Infinity,
    wasteRate: totalSpend > 0
      ? wastedSpend / totalSpend
      : 0,
  }
}

The formula for effective cost per output is:

cost-formula.txt

Effective Cost Per Output = Total Token Spend / Number of Usable Outputs

Where:
  Total Token Spend = Sum of (input_tokens + output_tokens) * price_per_token
                      for ALL requests, including failed attempts and retries

  Usable Outputs   = Count of requests that produced acceptable results
                      without human correction

Example:
  100 requests at $0.03 each = $3.00 total spend
  80 produced usable output on first attempt
  15 required one retry ($0.03 x 15 = $0.45 additional)
  5 failed entirely ($0.03 x 5 = $0.15 wasted)

  Total spend: $3.00 + $0.45 + $0.15 = $3.60
  Usable outputs: 80 + 15 = 95
  Cost per usable output: $3.60 / 95 = $0.0379
  Waste rate: $0.15 / $3.60 = 4.2%

This metric exposes a counterintuitive truth: upgrading to a more expensive model can reduce effective cost per output. If a cheaper model has a 60 percent success rate and requires an average of 1.7 attempts per usable output, while a model that costs three times more has a 95 percent success rate with 1.05 attempts, the expensive model is often cheaper in practice. The math depends on the specific prices and success rates, but the principle holds: raw token cost is a misleading metric when considered in isolation.

A $0.10 request that works on the first try is cheaper than a $0.01 request that needs twenty retries. Measure cost per usable output, not cost per token.

The Compound Effect

These five strategies are not alternatives. They are layers. Apply model selection first to get the right model for each task. Apply caching to eliminate redundant requests. Optimize prompts to reduce per-request token count. Batch non-real-time tasks for volume discounts. And track cost per usable output to ensure your optimizations are improving the metric that actually matters.

The compound effect is significant. A team that applies all five layers typically sees a 60 to 80 percent reduction in AI costs compared to the naive approach of sending every request to a frontier model with verbose prompts and no caching. At scale, this is the difference between AI being a sustainable capability and AI being an unsustainable expense that gets cut in the next budget cycle.

Cost management is not about spending less. It is about spending effectively. The goal is not the cheapest possible AI usage. The goal is the highest value per dollar spent. Sometimes that means using the most expensive model available because the task demands it. More often, it means using the cheapest model that meets the quality bar, caching aggressively, and measuring outcomes instead of inputs.

Key Takeaways

Model selection is the highest-leverage optimization. Match model capability to task complexity -- do not default to frontier models for classification tasks.

Exact match caching eliminates cost for repeated prompts. Semantic caching extends this to similar-enough prompts, but requires careful threshold tuning.

Prompt optimization reduces per-request token count. System prompts are the highest-leverage target because they are prepended to every request.

Batch processing offers 50 percent discounts from major providers. Any task that does not require real-time responses is a batch candidate.

Cost per usable output is the metric that matters. A more expensive model with higher success rates is often cheaper than a cheap model that requires retries.

Applied together, these five strategies typically reduce AI costs by 60 to 80 percent without meaningful quality degradation.

Frequently Asked Questions

Common questions about this topic

Prompt Libraries as Business Infrastructure The Content Pipeline: From Notion to Published in Four Stages

Prompt Engineering Craft